Skip to content

perf(gfql): general optimizations base (parse memoization + temporal dtype-gate) [PR0]#1652

Merged
lmeyerov merged 20 commits into
masterfrom
dev/gfql-opt-base
Jun 29, 2026
Merged

perf(gfql): general optimizations base (parse memoization + temporal dtype-gate) [PR0]#1652
lmeyerov merged 20 commits into
masterfrom
dev/gfql-opt-base

Conversation

@lmeyerov

@lmeyerov lmeyerov commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

PR0 — base of the GFQL polars-engine stack: general, engine-agnostic performance optimizations. These are not coupled to the polars engine (base Cypher / row-pipeline code), so they live at the bottom of the stack; PR #1648 (polars hop/chain) and #1649 (polars cypher row pipeline) rebase on top.

Two opts, both profiler-driven and parity-validated (full gfql suite: 2883 passed, 0 failed on the stack tip):

1. Cypher parse memoization (4a627383)

parse_cypher (lark parse+transform) is the dominant per-call cost of small Cypher queries — ~50% of a query at 100k rows (layer profiling). It's a pure function of the query text returning an immutable frozen AST, so memoize it (LRU). Repeated identical queries skip the ~15 ms parse → ~1.3–1.7× faster end-to-end at small/interactive sizes across pandas/polars/cuDF.

Safe: every Cypher AST node is @dataclass(frozen=True) and compile_cypher_query does not mutate the parsed tree (verified); the non-str/empty validation guard stays outside the cache so error behavior is unchanged and errors are not cached.

2. Temporal-detection dtype gate (b539212c, folds in #1651, fixes spurious half of #1650)

order_detect_temporal_mode ran astype(str) + up to 6 regex fullmatch passes on both operands of every comparison — including int64/bool columns that can never hold temporal text. A 10k-row where_rows(val > 50) (plain table output) fired ~160k spurious re.fullmatch. Now short-circuits for numeric/bool/complex dtypes. Byte-identical output; where_rows 3.1× (pandas), 4.4–13.3× (cuDF, scales with rows).

Test

  • + regression tests in cypher/test_parser.py (cache hit/immutability/correctness) and row/test_ordering.py (gated == ungated, byte-identical).
  • Full graphistry/tests/compute/gfql/ suite green on dgx (RAPIDS container, --gpus all): 2883 passed, 0 failed.

3. Additional generic-engine opts (folded in)

  • .unique()-before-.isin() removal (75b8b7af): dropped the redundant dedup pass feeding only an .isin() membership test (_filter_edges_by_endpoint, undirected combine masks, per-hop wavefront). isin(s) == isin(s.unique()) → byte-identical; one fewer GPU kernel launch.
  • Numeric string-detection dtype gate (e9ac88c4): sibling of Don't columns to rename them, use bindings to refer to them instead #2 for list/stringified-list detection.
  • Generic chain fast path (d806cc54, e7b0612a): node-only MATCH (n) and single-hop MATCH (a {f})-[e]->(b) skip the BFS forward/backward/combine machinery (pandas/cuDF; dask/spark/polars fall through). ~100× on pandas (node filter 204→2 ms @10m); cuDF stays on the resident frame (a couple semi-joins vs the BFS + ~31 drop_duplicates).

Fast-path review hardening (added via multi-wave adversarial review)

The fast path is a pure optimization, so it must be observationally equivalent to the full BFS path. Review found + fixed where it wasn't, with regression tests:

  • policy hooks — gated on not policy so prechain/postchain/postload always fire (a policy can deny); fixed 21 test_policy_* failures.
  • otel span — kept @otel_traced("gfql.chain") on chain() (had drifted onto the fast-path helper).
  • dangling edges — restrict 1-hop edges to those whose endpoints exist in the node table (+ dedup node ids); a node table omitting an endpoint no longer yields phantom edges / a non-empty result where the full path is empty.
  • NaN endpointsdropna() so a NaN id can't validate a NaN edge endpoint via .isin.
  • DRY — consolidated the dtype-kind gate into is_non_textual_scalar_dtype (single source of truth) with a drift guard vs filter_by_dict._is_numeric_dtype_safe.
  • edges-only full path — preserve the node-id binding on an edges-only chain through the full BFS path (so the unmaterialized edges-only case the fast path already handled is also correct on the full path), with a policy-hook guard test.
  • Differential-verified equivalent to the full path across 440 random well-formed graph×query cases + targeted edge cases; int-dtype-preservation on the fast path is the one intentional, locked difference.

Supersedes #1651 (folded here with original authorship). parse_expr memoization is stacked on top in #1657.

🤖 Generated with Claude Code

lmeyerov and others added 2 commits June 26, 2026 14:48
…eat queries)

parse_cypher's lark parse+transform is the dominant per-call cost of small
cypher queries — ~50% of a cypher call at 100k rows (PR7b layer profiling).
It is a pure function of the query text returning an immutable frozen AST, so
memoize it (LRU 512): repeated identical queries skip the parse, ~1.3–1.7x
faster end-to-end at small/interactive sizes across pandas/polars/cudf.

Safe: every cypher AST node is @DataClass(frozen=True) and compile_cypher_query
does not mutate the parsed tree (verified); the non-str/empty validation guard
stays outside the cache so error behavior is unchanged and errors are not cached.

- split validating guard (parse_cypher) from cached body (_parse_cypher_cached)
- full gfql suite green on dgx: 2875 passed, 0 failed; +2 cache regression tests

NOTE: independent of the polars engine (base cypher code) — cherry-pickable to
its own PR off master.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ngification (#1650)

`order_detect_temporal_mode` ran `astype(str)` + up to 6 regex `fullmatch`
passes on BOTH operands of every comparison, including numeric/bool columns
that can never hold temporal *text*. This fired ~160k spurious `re.fullmatch`
on a 10k-row `where_rows(val > 50)` whose output is a plain table — pure waste.

Gate: early-return None when `dtype.kind in {i,u,f,b,c}`. Object/string/datetime
columns are unaffected, so temporal-text detection is preserved.

Byte-identical output. Measured `where_rows` speedups:
  - pandas @10k:  48.9 -> 15.6 ms (3.1x)
  - cuDF   @10k:  37.4 ->  8.5 ms (4.4x)
  - cuDF   @100k: 191.0 -> 14.4 ms (13.3x, 1M edges)

Tests: pandas 91 + cuDF 83 temporal/ordering/row suites pass; adds unit
coverage for the gate (numeric/bool skip, object temporal still detected) and
an end-to-end numeric `where_rows` filter assertion.

Scope: this is the spurious-path half of #1650. Whole-entity `RETURN a` text
rendering (structured/Arrow returns) is tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on scans

Follows #1650/#1651 (which gated temporal-text detection). Profiling numeric
RETURN/ORDER BY queries @100k showed ~40–48% of the call was STILL spurious
`astype(str)` + regex on int/float/bool columns — not in temporal detection
(already gated) but in the LIST-detection + projection scans:
- order_detect_list_series / order_detect_stringified_list_series (ordering.py)
- RowPipelineMixin._gfql_series_is_list_like (pipeline.py)
- _project_property_column temporal scan (result_postprocess.py)

All run `series.astype(str)` to test for list/temporal *text* on every column,
including numeric/bool/complex columns that can never hold it. Gate them with a
shared `_is_non_listable_dtype` (kind in i,u,f,b,c) → early return False/series.
Byte-identical (the scans return False/None for these dtypes anyway); object/
string/datetime columns are untouched, preserving list/temporal detection.

Result (where the projection/order path runs over numeric cols): ORDER BY and
arithmetic-projection drop their ~50–100ms spurious string render to ~0.

+ tests: numeric/bool dtype pass-through; behavioral numeric ORDER BY (would fail
if values were lexically stringified). Base cypher/row code -> PR0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov and others added 3 commits June 27, 2026 09:30
…chain

isin() is set-membership, so isin(s) == isin(s.unique()) — the explicit
.unique()/dedup pass is redundant. Removed at _filter_edges_by_endpoint,
the undirected combine masks, and the per-hop wavefront filter. Each removed
.unique() is one fewer kernel launch on GPU (where launch latency, not compute,
dominates small/mid traversals). Byte-identical: verified across 345 adversarial
graph×query cases (dup/parallel edges, self-loops, isolated/dead-end nodes,
cycles, undirected, multi-hop, fixed-point, min/max hops, names, filters, seeds)
+ the hop/chain test suites. cuDF 1-hop chain ~126→103 ms @1m (~18%).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A single MATCH (n) (no edge hop) — the dominant tabular/crossfilter shape
(node filter, histogram, table search) — returns the filtered node table
directly + empty edges, skipping the generic forward/backward/combine BFS.
Helps pandas (default engine) + cuDF. Byte-identical (345-case adversarial
golden + hop/chain suites, 240 tests). ~100x faster pandas node filter at 10M
(204->2ms), cuDF stays on the resident frame (~0.8ms @1m). The 1-hop shape is
NOT generic-fast-pathed: the full machinery's node merge upcasts int->float
(pandas artifact), so a join-free generic 1-hop would change dtypes; the polars
engine has its own dtype-consistent 1-hop fast path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the generic node-only fast path to the simple 1-hop MATCH (a {f})-[e]->(b)
(the basic graph query + "filter then expand" crossfilter): return the edges
whose endpoints pass the node filters + those endpoint nodes, skipping the
forward/backward/combine BFS. Gated to pandas/cuDF (dask/spark keep the full
path; polars has its own). Same values + node/edge sets (345-case adversarial
golden = dtype-only diffs, values identical; no test regressions vs pristine).

Behavior change (intentional, more correct): the 1-hop now PRESERVES node-attr
dtypes (int stays int) instead of the full machinery's spurious int->float merge
upcast — making pandas/cuDF consistent with the polars engine (which already
preserved int). cuDF basic graph query now ~2 semi-joins on the resident frame
instead of the BFS + ~31 drop_duplicates, capturing the GPU semijoin win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The generic _try_chain_fast_path short-circuited chain() before the policy hook
dispatch, so a query that hits the fast path (node-only MATCH, single-hop) never
fired prechain/postchain (and skipped per-op policy inspection) — observable
behavior change vs the pre-fast-path engine. Gate the fast path on no-policy;
policy-bearing queries take the full path. 64 policy tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The generic single-hop fast path accepted edges with `prune_to_endpoints=True` (a
public e()/e_forward()/e_reverse() kwarg) but returned BOTH endpoints, while the flag
keeps only the arrival side (dest for forward, src for reverse). So
`[n(), e_forward(prune_to_endpoints=True), n()]` silently returned the wrong node/edge
set (verified: path 0→1→2→3 gave [0,1,2,3] vs correct [1,2,3]). Add the flag to the
fast-path edge gate so it declines and falls through to the full path, which honors it.
`is_simple_single_hop()` is intentionally left unchanged (its other callers depend on
the current semantics) — the gate is the correct, surgical layer.

Also document the deep-immutability invariant on the Cypher AST nodes: parse_cypher
results are shared by reference via lru_cache, so a future mutable field would poison
cache hits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code-quality pass on the opt-base fast path:
- Type the contract boundaries: _filter_edges_by_endpoint(edges_df: DataFrameT,
  nodes_df: Optional[DataFrameT], ...) -> DataFrameT; _try_chain_fast_path(g_in:
  Plottable, ops: List[ASTObject], engine_concrete: Engine, start_nodes:
  Optional[DataFrameT]).
- Replace getattr(n0, "query", None) duck-typing with n0.query (ASTNode declares it;
  the isinstance(n0, ASTNode) guard narrows the type).
- Convert Engine -> EngineAbstract once (engine_abs) for the filter_by_dict calls
  (its param is EngineAbstract|str), matching the existing materialize conversion.
- Trim verbose comments to terse one/two-liners (fast-path docstring, .unique removal,
  prune gate, policy gate, temporal dtype gate, AST immutability invariant).

No behavior change; mypy + ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 29, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov and others added 3 commits June 28, 2026 17:55
Adds a regression test for the fix in 2755744 (fast path must not skip
policy hooks): asserts that for the exact fast-path-eligible shapes
(`[n()]` and `[n(),e(),n()]`) the prechain/postchain/postload hooks all
fire when a policy is installed, and that the fast path is still taken when
no policy is present. Without a guard this regression was silent (the shapes
that hit the fast path are precisely the ones the policy suite exercises).

Also documents the policy-gate in the fast-path CHANGELOG entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chain fast path (_try_chain_fast_path) shipped with only incidental
coverage (existing hop/chain/golden suites); its core behaviors were never
pinned directly. Adds focused, engine-parametrized locks:

- test_fast_path_differential_parity_vs_full_path: fast path output ==
  full (policy-forced BFS) path output by node/edge SET, across every
  accelerated shape (node-only, node-filter, predicate, fwd/rev/undirected
  1-hop, src/dst/both-end filters) AND every bypass shape (hops=2,
  filtered-undirected, edge_match, named node). Uses an installed no-op
  policy as the equivalence oracle (any policy forces the full path).
- test_fast_path_preserves_int_node_dtypes: hard-locks the feature promise
  (1-hop fast path keeps int node attrs as int); tolerant on the full path.
- test_fast_path_gating_returns_none_for_ineligible: unit-level accept/decline
  contract for the gate (seeds, non-eager engines, edge_match, named/queried
  nodes, hops>1, 2-op chains all decline).
- _gfql_series_is_list_like dtype-gate: direct unit coverage (numeric/bool
  short-circuit; real list/tuple objects still detected; stringified lists
  intentionally deferred to order_detect_stringified_list_series).

pandas lanes run everywhere; cuDF lanes gated on TEST_CUDF=1 (dgx-spark).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new _try_chain_fast_path helper was inserted between the
@otel_traced("gfql.chain") decorator and def chain(), so the decorator
silently migrated onto the fast-path helper: chain() stopped emitting its
span entirely, and _chain_otel_attrs (which expects chain's signature) was
invoked with the fast path's positional args — binding `gfql.validate_schema`
to the start_nodes DataFrame. Only surfaces with OTEL enabled (attrs are
computed lazily), so CI stayed green. Move the decorator back onto chain().

Adds test_chain_otel_span_attrs_mapped_correctly: enables otel via monkeypatch,
captures the span, and asserts chain_len + that validate_schema is a bool
(fails under the misplaced decorator, passes with it on chain()).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov and others added 6 commits June 28, 2026 17:55
The #1650/#1651 dtype gate (numeric/bool/complex columns skip the spurious
astype(str)+regex list/temporal scan) was duplicated across 4 sites with 3
different sentinel conventions ("O", None, factored). Consolidate the kind
set into is_non_textual_scalar_dtype() in series_str_compat.py (the engine-poly
Series-string layer these gates short-circuit) and call it from ordering.py
(x2), pipeline.py, and cypher/result_postprocess.py. Byte-identical; adds a
direct parametrized unit test for the helper across dtype kinds + None.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 1-hop fast path took `edges = g._edges` wholesale and only intersected
filtered endpoints with the node table for the NODE output — it never
restricted the EDGES to those whose endpoints exist in the node table. The
full BFS path enforces that implicitly via its edge<->node joins, so the two
diverged whenever a supplied node table omits an edge endpoint (filtered/subset
node frames, or nodes+edges loaded separately):

  nodes v=[0,1], edges (s,d)=[(0,1),(1,99)]   # 99 absent
  [n(), e_forward(hops=1), n()]      fast kept (1,99); full dropped it
  [n({'attr':2}), e_forward, n()]    fast -> nodes[1] edges[(1,99)]; full -> EMPTY

Restrict edges to both-endpoints-present before the endpoint filters, and dedup
node ids on the result (the full path collapses dup rows via its merge). Now
fast == full across dangling-endpoint and duplicate-id graphs; no change on
well-formed graphs. Found by wave-2 differential review; tests added for both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The wave-2 both-endpoints-present restriction used `.isin(node_ids)`, but
pandas/cuDF `.isin` treat NaN as a matchable value (`NaN.isin([NaN])` is True).
So a NaN node id "validated" a NaN edge endpoint and the fast path kept a
dangling NaN edge that the full BFS path's joins (which never match NaN<->NaN)
correctly drop. dropna() on node_ids before the isin restores convergence for
forward/reverse on NaN-id graphs; well-formed and non-null-dangling graphs are
unchanged. (NaN ids are malformed input; the full path's undirected branch is
itself inconsistent there, so we match its directed/consistent behavior.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s, dup ids, NaN)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d; expr AST immutability note

Hardening tests for the fast-path + dtype-gate optimizations:
- prune_to_endpoints: add to the gating-decline list and as bypass shapes (fwd+rev) in
  the fast-vs-full differential — regression guard for the fast-path prune gate.
- NaN endpoints: new test_fast_path_drops_nan_endpoint_edges pins that a NaN node id does
  not validate a NaN edge endpoint (the .dropna() guard), matching the full BFS path.
- Tighten the differential test: bypass shapes now also assert they DECLINE the fast path
  (previously full-vs-full, vacuous for bypass cases).
- Drift guard: assert filter_by_dict._is_numeric_dtype_safe agrees with the
  series_str_compat._NON_TEXTUAL_SCALAR_KINDS SSOT across all numpy dtype kinds (the two
  keep separate copies to avoid an import cycle; they must not desync).
- cuDF lane for the numeric where_rows dtype-gate path (paired-coverage convention).

Also document the deep-immutability invariant on ExprNode (parse_expr lru_cache shares
results by reference).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te (#1657)

Code-quality pass on the parse-cache layer:
- Condense the verbose dropna/dangling-edge and dup-node fast-path comments to terse
  one/two-liners; same for the SSOT dtype-kind note and the ExprNode immutability
  invariant.
- Note in the SSOT comment that filter_by_dict's mirror is kept in sync by a test
  (not imported) to avoid an import cycle.

No behavior change; mypy + ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 29, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…policy-hook guard test)

Root-cause fix for a real bug surfaced while hardening the fast-path/policy work:

A chain/Cypher query over an edges-only graph (no node-id binding, `g._node is None`)
takes the full traversal path — the node-only/single-hop fast path is skipped whenever
an edge-id index must be synthesized (`added_edge_index`), or when a policy is attached.
That path rebuilt `g_out` from `self` (the *unbound* input) to restore the original edge
binding, but `self._node is None`, so the materialized node-id binding was dropped. The
endpoint-reconciliation concat then synthesized a spurious `None`-named node column —
a corrupt result on older pandas, and a hard `NotImplementedError` (void-block NA fill)
on newer pandas (Python 3.14). Fix: carry the materialized `g._node` binding explicitly
(`self.nodes(final_nodes_df, g._node)`) while still restoring the original edge binding.
Full-path output now matches the fast path: correct single node column, valid `_node`,
edge binding preserved. Pre-existing on master (not introduced by this PR's fast paths).

Also adds a parametrized regression test pinning that BOTH chain fast-path shapes
(node-only `[n()]` and single-hop `[n(),e(),n()]`) still fire `postload` under a policy
AND return a valid, non-corrupt result on an edges-only graph — guarding both the
policy-hook gate (2755744) and this node-binding fix. CHANGELOG: documents both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 29, 2026
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is
cached, profiling the residual fixed per-call compile cost showed it is
dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per
query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a
Lark transformer that defines a frozen dataclass via runtime exec
(dataclasses._process_class / _create_fn / exec churn).

parse_expr() is a pure function of the expression string (no params/schema) and
returns a tree of frozen dataclasses (17 frozen node types, tuple-valued
fields, immutable). So identical expressions — re-parsed on every compile, and
recurring across queries (e.g. `a.val > 50`) — are memoized via
lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only
successful parses are cached. No source consumer outside graphistry/compute/gfql
calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__).

Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed
per-call cost of a repeated string Cypher query drops a further ~4.5 ms ->
~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent
native chain.

Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py,
so the two apply independently in either order.

Tests: 3 focused expr-cache tests (memoization identity, distinct + hit
registration, invalid non-caching). Full graphistry/tests/compute/gfql/:
2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts
fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path
test in an image that has cugraph). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov merged commit 984e4df into master Jun 29, 2026
77 checks passed
@lmeyerov lmeyerov deleted the dev/gfql-opt-base branch June 29, 2026 02:08
@lmeyerov lmeyerov restored the dev/gfql-opt-base branch June 29, 2026 02:10
@lmeyerov lmeyerov deleted the dev/gfql-opt-base branch June 29, 2026 02:11
@lmeyerov lmeyerov mentioned this pull request Jun 29, 2026
lmeyerov added a commit that referenced this pull request Jun 29, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jul 1, 2026
… OTel span placement

Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic
fast paths) found 2 bugs:

- BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge
  endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a
  source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns
  join keys; the chain fast paths + combine did not. Added _align_edge_endpoints
  (cast endpoints to node-id dtype for the traversal, restore output dtype to match
  pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop.
- (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal
  _try_chain_fast_path probe (inserted between the decorator and def chain) instead of
  the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it.

Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass
(4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes).

Row-order divergences the review also found (fast path returns table order vs the
full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher-
undefined, sets/values identical, no repo test depends on it) are claim-precision, not
bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant