Skip to content

GFQL: cuDF cross-engine result divergences (list-literal order, toString(float), min_hops seed hop-label, group_by Series-truthiness) #1663

Description

@lmeyerov

Summary

Four cuDF-vs-pandas result divergences in GFQL, all found while building the native polars-engine
differential-conformance matrix (graphistry/tests/compute/gfql/test_engine_polars_conformance_matrix.py).
All are cuDF-engine issues: g.gfql(query, engine='cudf') differs from engine='pandas' (the oracle).
They are orthogonal to the polars engine (polars is parity-or-honest-NIE on each) and are currently
scoped out of the 4-engine conformance _assert_invariant (dedicated pandas-vs-polars tests cover the
polars intent), so they don't block the polars work — but each is a real cuDF correctness gap.

Repro pattern: run the same query on engine='pandas' vs engine='cudf' and compare value-level.


1. cuDF reorders list-literal [a, b, c] elements vs pandas

A cypher list literal materialized into a column (e.g. a row-pipeline expr building [n.num, n.num+1, 99])
comes back with the list ELEMENTS permuted under cuDF; pandas preserves construction order.

  • Expected: element order matches pandas (construction order). Severity: wrong-answer for list-valued projections.

2. cuDF formats toString(float) differently than pandas

toString(n.f) over a float column yields a different string representation under cuDF than pandas
(precision / trailing zeros / exponent style).

  • Expected: match pandas str(float) formatting (or document a canonical format). polars declines this as
    honest-NIE (it also can't match pandas float-repr), so cuDF is the silent-divergence here.

3. cuDF multi-hop min_hops>1 labels the SEED node's hop wrong

For e_forward(min_hops=2, max_hops=3) etc., the SEED node appears with __gfql_output_node_hop__ =
max_hops under cuDF but None/NaN under pandas. (Secondary: num comes back int under cuDF vs float
under pandas for the same result.)

  • Repro: [n({"id":[0]}), e_forward(min_hops=2, max_hops=3), n()] on a small attributed graph; compare the
    seed row's __gfql_output_node_hop__. Found via the NA-hardened conformance signature.

4. cuDF group_by row-op raises "truth value of a Series is ambiguous"

call("group_by", {"keys":[k], "aggregations":[("c","count"),("s","sum",col)]}) on a row table that carries
EXTRA non-key/non-aggregated columns (e.g. float f + string name alongside grouped flag/num) raises:
GFQLTypeError: [invalid-node-reference] Error executing 'group_by': The truth value of a Series is ambiguous.

  • Repro: a 5-col node frame (id/num/f/name/flag) + the group_by above; pandas+polars return [flag,c,s],
    cuDF raises. A minimal 3-col graph does NOT trigger it — the extra columns drive a if <series>: path
    (should be .any()/.all()). Likely in the GFQL group_by handler graphistry/compute/gfql/row/pipeline.py.

Findings 1–2 were known from earlier sessions; 3–4 were found 2026-06-30. None blocks the polars engine PRs.
🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions