feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654] by lmeyerov · Pull Request #1655 · graphistry/pygraphistry

lmeyerov · 2026-06-27T02:06:05Z

Summary

GPU execution target of the lazy Polars engine (engine='polars-gpu') — the redo of the per-op GPU PR (#1654, which was a perf regression) as a thin target of the lazy engine on #1648. engine='polars' and engine='polars-gpu' are now one lazy engine, two targets: build a single deferred pl.LazyFrame plan per single-hop and collect_all it once on CPU or GPU.

Why this replaces #1654

Benchmark showed per-op eager GPU collect was a regression (each op re-transfers tables H2D). collect-once is the fix: single-hop GPU is now a 2.84× win @1m (vs eager) with CPU parity. The win flows from the lazy engine on #1648; this PR just selects the GPU target.

Design (contained)

Engine.POLARS_GPU='polars-gpu', explicit opt-in only (AUTO never selects it); frames stay pl.DataFrame (treated like POLARS in all frame ops). Extends POLARS_ENGINES (introduced on feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648) to include the GPU target — so every engine-aware helper covers it for free.
Dispatch wraps the lazy hop/chain in target_mode(GPU) (the lazy/ framework's collect/collect_all already run on the active target). engine='polars' (CPU) is byte-for-byte unchanged.
raise_on_fail=False — GPU-incapable nodes stay on CPU in Polars (no pandas bridge; NO-CHEATING). Uses the cudf-polars in-memory executor (executor="in-memory") — faster + more stable than the default streaming engine="gpu" for in-device-memory GFQL results.

Also here: opt-in CPU streaming collect

GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects on the streaming executor (~1.04–1.11× faster on large multi-hop traversals, parity-identical), default off because small/interactive sizes regress (~0.86× from streaming overhead). No change to default behavior.

Honest scope

Single-hop GPU wins; the chain-level GPU win currently dilutes (a chain runs forward+backward = 2 hop collects + eager _combine_*). Fusing those + moving combine onto the target is the next benchmark-driven opt on this PR.

Validation (dgx, RAPIDS `--gpus all`)

Parity engine='polars-gpu' == engine='polars': test_engine_polars_gpu.py (skips without cudf_polars).
Full graphistry/tests/compute/gfql/ polars suite green; CPU engine='polars' is byte-for-byte unchanged from feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648 (the GPU target adds only the 2 GPU/streaming commits on top).

Stacks on #1648 (lazy Polars engine) → #1652 (general opts) → master. Supersedes #1654 (per-op GPU).

⚠️ This branch was force-rewritten on 2026-06-28 (restack moved the CPU conformance fixes down to #1648). If you have it checked out: git fetch && git reset --hard origin/dev/gfql-lazy-gpu.

🤖 Generated with Claude Code

…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(P-B) GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects (hop/chain) on the streaming executor. Benchmarked (dgx, interleaved A/B, parity-identical): ~1.11x at 10M nodes/80M edges (20.0->18.0s), ~1.04x at 1M, but ~0.86x (slower) at 100K — the streaming overhead loses on small/interactive sizes. So default OFF (behavior unchanged); opt-in for large batch traversals. From the blogpost perf-opt handoff item B (polars-CPU heavy-join scaling). The full streaming win in isolation is larger (80M 2-hop semijoin 1669->1040ms, 1.6x); the real chain dilutes it via the forward/backward/combine overhead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ragma GPU collect #1655 changed-line-coverage gate (newly enforced once upstream jobs pass) flagged 8 GPU/polars dispatch lines. Fix honestly: a CPU test exercises the lazy collect()/ collect_all() CPU path + the POLARS branches of df_concat/df_cons/s_cons/df_to_engine (reachable but not hit by the coverage suites); only the 2 genuinely GPU-target collect lines (lf.collect(engine=gpu) / pl.collect_all(..., engine=gpu)) are pragma:no-cover (need a device CI lacks). Changed-line coverage of #1655 back to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…allback (NO-CHEATING) The lazy GPU collect used pl.GPUEngine(raise_on_fail=False): any plan node the cudf_polars backend cannot execute silently ran on CPU and was still reported as a polars-gpu result -- so engine='polars-gpu' was indistinguishable from engine='polars' whenever the plan was not fully GPU-capable. A bulk bench showing near-identical polars/polars-gpu timings is exactly this tell. Flip to raise_on_fail=True and translate the cudf_polars failure into a clear NotImplementedError pointing at engine='polars'. polars-gpu is now GPU-or-error: any timing it produces is real on-device work, never CPU mislabeled as GPU. Verified on dgx-spark (LiveJournal 35M): the seeded hop / 2-hop chain plan runs fully on GPU without raising (nvidia-smi 92% util), so existing GPU timings are unchanged -- only the honesty guarantee is added. +1 regression test covering the translated error path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov force-pushed the dev/gfql-polars-engine branch from dcce5fa to be04687 Compare June 27, 2026 15:11

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from d93976c to a7df2f2 Compare June 27, 2026 15:11

lmeyerov force-pushed the dev/gfql-polars-engine branch from be04687 to aefd073 Compare June 27, 2026 16:35

lmeyerov force-pushed the dev/gfql-lazy-gpu branch 6 times, most recently from d8b2074 to a684248 Compare June 27, 2026 17:35

lmeyerov force-pushed the dev/gfql-polars-engine branch 2 times, most recently from 35f65b5 to e9f29bd Compare June 27, 2026 18:12

lmeyerov force-pushed the dev/gfql-lazy-gpu branch 4 times, most recently from ea0cf19 to 5529138 Compare June 28, 2026 07:34

lmeyerov mentioned this pull request Jun 28, 2026

feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648

Closed

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 5529138 to 194671a Compare June 28, 2026 08:09

lmeyerov mentioned this pull request Jun 28, 2026

feat(gfql): physical adjacency indexes for O(degree) seeded traversal #1658

Open

lmeyerov force-pushed the dev/gfql-polars-engine branch from 32799c6 to 20f58ca Compare June 28, 2026 20:00

lmeyerov force-pushed the dev/gfql-lazy-gpu branch 2 times, most recently from e5321e4 to 5c60a46 Compare June 28, 2026 20:27

lmeyerov force-pushed the dev/gfql-polars-engine branch from 318d2cc to fd2a620 Compare June 28, 2026 20:51

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 36cf1fd to 7207dc8 Compare June 28, 2026 20:51

lmeyerov force-pushed the dev/gfql-polars-engine branch from fd2a620 to 2dcde42 Compare June 28, 2026 21:02

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 7207dc8 to 8591416 Compare June 28, 2026 21:02

lmeyerov force-pushed the dev/gfql-polars-engine branch from 2dcde42 to 365ac8f Compare June 28, 2026 21:59

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 8591416 to 3d47a07 Compare June 28, 2026 21:59

lmeyerov force-pushed the dev/gfql-polars-engine branch from 365ac8f to 4c720af Compare June 29, 2026 03:14

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 16d5beb to eea6c96 Compare June 29, 2026 03:14

lmeyerov mentioned this pull request Jun 29, 2026

feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1660

Open

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from eea6c96 to 38a08e7 Compare June 29, 2026 23:08

lmeyerov and others added 4 commits June 30, 2026 18:13

lmeyerov force-pushed the dev/gfql-polars-engine branch from f7dcde6 to 8577f4f Compare July 1, 2026 01:19

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 38a08e7 to c5bc2fd Compare July 1, 2026 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655
lmeyerov wants to merge 4 commits into
dev/gfql-polars-enginefrom
dev/gfql-lazy-gpu

lmeyerov commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lmeyerov commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this replaces #1654

Design (contained)

Also here: opt-in CPU streaming collect

Honest scope

Validation (dgx, RAPIDS --gpus all)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lmeyerov commented Jun 27, 2026 •

edited

Loading

Validation (dgx, RAPIDS `--gpus all`)