Skip to content

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655

Open
lmeyerov wants to merge 4 commits into
dev/gfql-polars-enginefrom
dev/gfql-lazy-gpu
Open

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655
lmeyerov wants to merge 4 commits into
dev/gfql-polars-enginefrom
dev/gfql-lazy-gpu

Conversation

@lmeyerov

@lmeyerov lmeyerov commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

GPU execution target of the lazy Polars engine (engine='polars-gpu') — the redo of the per-op GPU PR (#1654, which was a perf regression) as a thin target of the lazy engine on #1648. engine='polars' and engine='polars-gpu' are now one lazy engine, two targets: build a single deferred pl.LazyFrame plan per single-hop and collect_all it once on CPU or GPU.

Why this replaces #1654

Benchmark showed per-op eager GPU collect was a regression (each op re-transfers tables H2D). collect-once is the fix: single-hop GPU is now a 2.84× win @1m (vs eager) with CPU parity. The win flows from the lazy engine on #1648; this PR just selects the GPU target.

Design (contained)

  • Engine.POLARS_GPU='polars-gpu', explicit opt-in only (AUTO never selects it); frames stay pl.DataFrame (treated like POLARS in all frame ops). Extends POLARS_ENGINES (introduced on feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648) to include the GPU target — so every engine-aware helper covers it for free.
  • Dispatch wraps the lazy hop/chain in target_mode(GPU) (the lazy/ framework's collect/collect_all already run on the active target). engine='polars' (CPU) is byte-for-byte unchanged.
  • raise_on_fail=False — GPU-incapable nodes stay on CPU in Polars (no pandas bridge; NO-CHEATING). Uses the cudf-polars in-memory executor (executor="in-memory") — faster + more stable than the default streaming engine="gpu" for in-device-memory GFQL results.

Also here: opt-in CPU streaming collect

GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects on the streaming executor (~1.04–1.11× faster on large multi-hop traversals, parity-identical), default off because small/interactive sizes regress (~0.86× from streaming overhead). No change to default behavior.

Honest scope

Single-hop GPU wins; the chain-level GPU win currently dilutes (a chain runs forward+backward = 2 hop collects + eager _combine_*). Fusing those + moving combine onto the target is the next benchmark-driven opt on this PR.

Validation (dgx, RAPIDS --gpus all)

Stacks on #1648 (lazy Polars engine) → #1652 (general opts) → master. Supersedes #1654 (per-op GPU).

⚠️ This branch was force-rewritten on 2026-06-28 (restack moved the CPU conformance fixes down to #1648). If you have it checked out: git fetch && git reset --hard origin/dev/gfql-lazy-gpu.

🤖 Generated with Claude Code

@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from dcce5fa to be04687 Compare June 27, 2026 15:11
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from d93976c to a7df2f2 Compare June 27, 2026 15:11
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from be04687 to aefd073 Compare June 27, 2026 16:35
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch 6 times, most recently from d8b2074 to a684248 Compare June 27, 2026 17:35
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch 2 times, most recently from 35f65b5 to e9f29bd Compare June 27, 2026 18:12
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch 4 times, most recently from ea0cf19 to 5529138 Compare June 28, 2026 07:34
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 5529138 to 194671a Compare June 28, 2026 08:09
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from 32799c6 to 20f58ca Compare June 28, 2026 20:00
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch 2 times, most recently from e5321e4 to 5c60a46 Compare June 28, 2026 20:27
lmeyerov added a commit that referenced this pull request Jun 28, 2026
…ragma GPU collect

#1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8
GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/
collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine
(reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect
lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover
(need a device CI lacks). Changed-line coverage of #1655 back to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from 318d2cc to fd2a620 Compare June 28, 2026 20:51
lmeyerov added a commit that referenced this pull request Jun 28, 2026
…ragma GPU collect

#1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8
GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/
collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine
(reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect
lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover
(need a device CI lacks). Changed-line coverage of #1655 back to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 36cf1fd to 7207dc8 Compare June 28, 2026 20:51
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from fd2a620 to 2dcde42 Compare June 28, 2026 21:02
lmeyerov added a commit that referenced this pull request Jun 28, 2026
…ragma GPU collect

#1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8
GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/
collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine
(reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect
lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover
(need a device CI lacks). Changed-line coverage of #1655 back to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 7207dc8 to 8591416 Compare June 28, 2026 21:02
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from 2dcde42 to 365ac8f Compare June 28, 2026 21:59
lmeyerov added a commit that referenced this pull request Jun 28, 2026
…ragma GPU collect

#1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8
GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/
collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine
(reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect
lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover
(need a device CI lacks). Changed-line coverage of #1655 back to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 8591416 to 3d47a07 Compare June 28, 2026 21:59
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from 365ac8f to 4c720af Compare June 29, 2026 03:14
lmeyerov added a commit that referenced this pull request Jun 29, 2026
…ragma GPU collect

#1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8
GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/
collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine
(reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect
lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover
(need a device CI lacks). Changed-line coverage of #1655 back to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 16d5beb to eea6c96 Compare June 29, 2026 03:14
lmeyerov added a commit that referenced this pull request Jun 29, 2026
…ragma GPU collect

#1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8
GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/
collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine
(reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect
lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover
(need a device CI lacks). Changed-line coverage of #1655 back to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from eea6c96 to 38a08e7 Compare June 29, 2026 23:08
lmeyerov and others added 4 commits June 30, 2026 18:13
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(P-B)

GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects (hop/chain) on the
streaming executor. Benchmarked (dgx, interleaved A/B, parity-identical): ~1.11x at
10M nodes/80M edges (20.0->18.0s), ~1.04x at 1M, but ~0.86x (slower) at 100K — the
streaming overhead loses on small/interactive sizes. So default OFF (behavior
unchanged); opt-in for large batch traversals.

From the blogpost perf-opt handoff item B (polars-CPU heavy-join scaling). The full
streaming win in isolation is larger (80M 2-hop semijoin 1669->1040ms, 1.6x); the
real chain dilutes it via the forward/backward/combine overhead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ragma GPU collect

#1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8
GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/
collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine
(reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect
lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover
(need a device CI lacks). Changed-line coverage of #1655 back to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…allback (NO-CHEATING)

The lazy GPU collect used pl.GPUEngine(raise_on_fail=False): any plan node the
cudf_polars backend cannot execute silently ran on CPU and was still reported as
a polars-gpu result -- so engine='polars-gpu' was indistinguishable from
engine='polars' whenever the plan was not fully GPU-capable. A bulk bench showing
near-identical polars/polars-gpu timings is exactly this tell.

Flip to raise_on_fail=True and translate the cudf_polars failure into a clear
NotImplementedError pointing at engine='polars'. polars-gpu is now GPU-or-error:
any timing it produces is real on-device work, never CPU mislabeled as GPU.

Verified on dgx-spark (LiveJournal 35M): the seeded hop / 2-hop chain plan runs
fully on GPU without raising (nvidia-smi 92% util), so existing GPU timings are
unchanged -- only the honesty guarantee is added. +1 regression test covering the
translated error path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from f7dcde6 to 8577f4f Compare July 1, 2026 01:19
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 38a08e7 to c5bc2fd Compare July 1, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant