feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655
Open
lmeyerov wants to merge 4 commits into
Open
feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655lmeyerov wants to merge 4 commits into
lmeyerov wants to merge 4 commits into
Conversation
dcce5fa to
be04687
Compare
d93976c to
a7df2f2
Compare
be04687 to
aefd073
Compare
d8b2074 to
a684248
Compare
35f65b5 to
e9f29bd
Compare
ea0cf19 to
5529138
Compare
5529138 to
194671a
Compare
32799c6 to
20f58ca
Compare
e5321e4 to
5c60a46
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
318d2cc to
fd2a620
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
36cf1fd to
7207dc8
Compare
fd2a620 to
2dcde42
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7207dc8 to
8591416
Compare
2dcde42 to
365ac8f
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8591416 to
3d47a07
Compare
365ac8f to
4c720af
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 29, 2026
…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
16d5beb to
eea6c96
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 29, 2026
…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
eea6c96 to
38a08e7
Compare
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(P-B) GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects (hop/chain) on the streaming executor. Benchmarked (dgx, interleaved A/B, parity-identical): ~1.11x at 10M nodes/80M edges (20.0->18.0s), ~1.04x at 1M, but ~0.86x (slower) at 100K — the streaming overhead loses on small/interactive sizes. So default OFF (behavior unchanged); opt-in for large batch traversals. From the blogpost perf-opt handoff item B (polars-CPU heavy-join scaling). The full streaming win in isolation is larger (80M 2-hop semijoin 1669->1040ms, 1.6x); the real chain dilutes it via the forward/backward/combine overhead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…allback (NO-CHEATING) The lazy GPU collect used pl.GPUEngine(raise_on_fail=False): any plan node the cudf_polars backend cannot execute silently ran on CPU and was still reported as a polars-gpu result -- so engine='polars-gpu' was indistinguishable from engine='polars' whenever the plan was not fully GPU-capable. A bulk bench showing near-identical polars/polars-gpu timings is exactly this tell. Flip to raise_on_fail=True and translate the cudf_polars failure into a clear NotImplementedError pointing at engine='polars'. polars-gpu is now GPU-or-error: any timing it produces is real on-device work, never CPU mislabeled as GPU. Verified on dgx-spark (LiveJournal 35M): the seeded hop / 2-hop chain plan runs fully on GPU without raising (nvidia-smi 92% util), so existing GPU timings are unchanged -- only the honesty guarantee is added. +1 regression test covering the translated error path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f7dcde6 to
8577f4f
Compare
38a08e7 to
c5bc2fd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GPU execution target of the lazy Polars engine (
engine='polars-gpu') — the redo of the per-op GPU PR (#1654, which was a perf regression) as a thin target of the lazy engine on #1648.engine='polars'andengine='polars-gpu'are now one lazy engine, two targets: build a single deferredpl.LazyFrameplan per single-hop andcollect_allit once on CPU or GPU.Why this replaces #1654
Benchmark showed per-op eager GPU collect was a regression (each op re-transfers tables H2D). collect-once is the fix: single-hop GPU is now a 2.84× win @1m (vs eager) with CPU parity. The win flows from the lazy engine on #1648; this PR just selects the GPU target.
Design (contained)
Engine.POLARS_GPU='polars-gpu', explicit opt-in only (AUTO never selects it); frames staypl.DataFrame(treated likePOLARSin all frame ops). ExtendsPOLARS_ENGINES(introduced on feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648) to include the GPU target — so every engine-aware helper covers it for free.target_mode(GPU)(thelazy/framework'scollect/collect_allalready run on the active target).engine='polars'(CPU) is byte-for-byte unchanged.raise_on_fail=False— GPU-incapable nodes stay on CPU in Polars (no pandas bridge; NO-CHEATING). Uses the cudf-polars in-memory executor (executor="in-memory") — faster + more stable than the default streamingengine="gpu"for in-device-memory GFQL results.Also here: opt-in CPU streaming collect
GFQL_POLARS_CPU_STREAMING=1runs the polars-CPU lazy collects on the streaming executor (~1.04–1.11× faster on large multi-hop traversals, parity-identical), default off because small/interactive sizes regress (~0.86× from streaming overhead). No change to default behavior.Honest scope
Single-hop GPU wins; the chain-level GPU win currently dilutes (a chain runs forward+backward = 2 hop collects + eager
_combine_*). Fusing those + moving combine onto the target is the next benchmark-driven opt on this PR.Validation (dgx, RAPIDS
--gpus all)engine='polars-gpu' == engine='polars':test_engine_polars_gpu.py(skips without cudf_polars).graphistry/tests/compute/gfql/polars suite green; CPUengine='polars'is byte-for-byte unchanged from feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648 (the GPU target adds only the 2 GPU/streaming commits on top).Stacks on #1648 (lazy Polars engine) → #1652 (general opts) → master. Supersedes #1654 (per-op GPU).
🤖 Generated with Claude Code