perf(gfql): general optimizations base (parse memoization + temporal dtype-gate) [PR0] by lmeyerov · Pull Request #1652 · graphistry/pygraphistry

lmeyerov · 2026-06-26T22:12:20Z

Summary

PR0 — base of the GFQL polars-engine stack: general, engine-agnostic performance optimizations. These are not coupled to the polars engine (base Cypher / row-pipeline code), so they live at the bottom of the stack; PR #1648 (polars hop/chain) and #1649 (polars cypher row pipeline) rebase on top.

Two opts, both profiler-driven and parity-validated (full gfql suite: 2883 passed, 0 failed on the stack tip):

1. Cypher parse memoization (`4a627383`)

parse_cypher (lark parse+transform) is the dominant per-call cost of small Cypher queries — ~50% of a query at 100k rows (layer profiling). It's a pure function of the query text returning an immutable frozen AST, so memoize it (LRU). Repeated identical queries skip the ~15 ms parse → ~1.3–1.7× faster end-to-end at small/interactive sizes across pandas/polars/cuDF.

Safe: every Cypher AST node is @dataclass(frozen=True) and compile_cypher_query does not mutate the parsed tree (verified); the non-str/empty validation guard stays outside the cache so error behavior is unchanged and errors are not cached.

2. Temporal-detection dtype gate (`b539212c`, folds in #1651, fixes spurious half of #1650)

order_detect_temporal_mode ran astype(str) + up to 6 regex fullmatch passes on both operands of every comparison — including int64/bool columns that can never hold temporal text. A 10k-row where_rows(val > 50) (plain table output) fired ~160k spurious re.fullmatch. Now short-circuits for numeric/bool/complex dtypes. Byte-identical output; where_rows 3.1× (pandas), 4.4–13.3× (cuDF, scales with rows).

Test

+ regression tests in cypher/test_parser.py (cache hit/immutability/correctness) and row/test_ordering.py (gated == ungated, byte-identical).
Full graphistry/tests/compute/gfql/ suite green on dgx (RAPIDS container, --gpus all): 2883 passed, 0 failed.

3. Additional generic-engine opts (folded in)

.unique()-before-.isin() removal (75b8b7af): dropped the redundant dedup pass feeding only an .isin() membership test (_filter_edges_by_endpoint, undirected combine masks, per-hop wavefront). isin(s) == isin(s.unique()) → byte-identical; one fewer GPU kernel launch.
Numeric string-detection dtype gate (e9ac88c4): sibling of Don't columns to rename them, use bindings to refer to them instead #2 for list/stringified-list detection.
Generic chain fast path (d806cc54, e7b0612a): node-only MATCH (n) and single-hop MATCH (a {f})-[e]->(b) skip the BFS forward/backward/combine machinery (pandas/cuDF; dask/spark/polars fall through). ~100× on pandas (node filter 204→2 ms @10m); cuDF stays on the resident frame (a couple semi-joins vs the BFS + ~31 drop_duplicates).

Fast-path review hardening (added via multi-wave adversarial review)

The fast path is a pure optimization, so it must be observationally equivalent to the full BFS path. Review found + fixed where it wasn't, with regression tests:

policy hooks — gated on not policy so prechain/postchain/postload always fire (a policy can deny); fixed 21 test_policy_* failures.
otel span — kept @otel_traced("gfql.chain") on chain() (had drifted onto the fast-path helper).
dangling edges — restrict 1-hop edges to those whose endpoints exist in the node table (+ dedup node ids); a node table omitting an endpoint no longer yields phantom edges / a non-empty result where the full path is empty.
NaN endpoints — dropna() so a NaN id can't validate a NaN edge endpoint via .isin.
DRY — consolidated the dtype-kind gate into is_non_textual_scalar_dtype (single source of truth) with a drift guard vs filter_by_dict._is_numeric_dtype_safe.
edges-only full path — preserve the node-id binding on an edges-only chain through the full BFS path (so the unmaterialized edges-only case the fast path already handled is also correct on the full path), with a policy-hook guard test.
Differential-verified equivalent to the full path across 440 random well-formed graph×query cases + targeted edge cases; int-dtype-preservation on the fast path is the one intentional, locked difference.

Supersedes #1651 (folded here with original authorship). parse_expr memoization is stacked on top in #1657.

🤖 Generated with Claude Code

…eat queries) parse_cypher's lark parse+transform is the dominant per-call cost of small cypher queries — ~50% of a cypher call at 100k rows (PR7b layer profiling). It is a pure function of the query text returning an immutable frozen AST, so memoize it (LRU 512): repeated identical queries skip the parse, ~1.3–1.7x faster end-to-end at small/interactive sizes across pandas/polars/cudf. Safe: every cypher AST node is @DataClass(frozen=True) and compile_cypher_query does not mutate the parsed tree (verified); the non-str/empty validation guard stays outside the cache so error behavior is unchanged and errors are not cached. - split validating guard (parse_cypher) from cached body (_parse_cypher_cached) - full gfql suite green on dgx: 2875 passed, 0 failed; +2 cache regression tests NOTE: independent of the polars engine (base cypher code) — cherry-pickable to its own PR off master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@10k

…ngification (#1650) `order_detect_temporal_mode` ran `astype(str)` + up to 6 regex `fullmatch` passes on BOTH operands of every comparison, including numeric/bool columns that can never hold temporal *text*. This fired ~160k spurious `re.fullmatch` on a 10k-row `where_rows(val > 50)` whose output is a plain table — pure waste. Gate: early-return None when `dtype.kind in {i,u,f,b,c}`. Object/string/datetime columns are unaffected, so temporal-text detection is preserved. Byte-identical output. Measured `where_rows` speedups: - pandas @10k: 48.9 -> 15.6 ms (3.1x) - cuDF @10k: 37.4 -> 8.5 ms (4.4x) - cuDF @100k: 191.0 -> 14.4 ms (13.3x, 1M edges) Tests: pandas 91 + cuDF 83 temporal/ordering/row suites pass; adds unit coverage for the gate (numeric/bool skip, object temporal still detected) and an end-to-end numeric `where_rows` filter assertion. Scope: this is the spurious-path half of #1650. Whole-entity `RETURN a` text rendering (structured/Arrow returns) is tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on scans Follows #1650/#1651 (which gated temporal-text detection). Profiling numeric RETURN/ORDER BY queries @100k showed ~40–48% of the call was STILL spurious `astype(str)` + regex on int/float/bool columns — not in temporal detection (already gated) but in the LIST-detection + projection scans: - order_detect_list_series / order_detect_stringified_list_series (ordering.py) - RowPipelineMixin._gfql_series_is_list_like (pipeline.py) - _project_property_column temporal scan (result_postprocess.py) All run `series.astype(str)` to test for list/temporal *text* on every column, including numeric/bool/complex columns that can never hold it. Gate them with a shared `_is_non_listable_dtype` (kind in i,u,f,b,c) → early return False/series. Byte-identical (the scans return False/None for these dtypes anyway); object/ string/datetime columns are untouched, preserving list/temporal detection. Result (where the projection/order path runs over numeric cols): ORDER BY and arithmetic-projection drop their ~50–100ms spurious string render to ~0. + tests: numeric/bool dtype pass-through; behavioral numeric ORDER BY (would fail if values were lexically stringified). Base cypher/row code -> PR0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

…chain isin() is set-membership, so isin(s) == isin(s.unique()) — the explicit .unique()/dedup pass is redundant. Removed at _filter_edges_by_endpoint, the undirected combine masks, and the per-hop wavefront filter. Each removed .unique() is one fewer kernel launch on GPU (where launch latency, not compute, dominates small/mid traversals). Byte-identical: verified across 345 adversarial graph×query cases (dup/parallel edges, self-loops, isolated/dead-end nodes, cycles, undirected, multi-hop, fixed-point, min/max hops, names, filters, seeds) + the hop/chain test suites. cuDF 1-hop chain ~126→103 ms @1m (~18%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

A single MATCH (n) (no edge hop) — the dominant tabular/crossfilter shape (node filter, histogram, table search) — returns the filtered node table directly + empty edges, skipping the generic forward/backward/combine BFS. Helps pandas (default engine) + cuDF. Byte-identical (345-case adversarial golden + hop/chain suites, 240 tests). ~100x faster pandas node filter at 10M (204->2ms), cuDF stays on the resident frame (~0.8ms @1m). The 1-hop shape is NOT generic-fast-pathed: the full machinery's node merge upcasts int->float (pandas artifact), so a join-free generic 1-hop would change dtypes; the polars engine has its own dtype-consistent 1-hop fast path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extend the generic node-only fast path to the simple 1-hop MATCH (a {f})-[e]->(b) (the basic graph query + "filter then expand" crossfilter): return the edges whose endpoints pass the node filters + those endpoint nodes, skipping the forward/backward/combine BFS. Gated to pandas/cuDF (dask/spark keep the full path; polars has its own). Same values + node/edge sets (345-case adversarial golden = dtype-only diffs, values identical; no test regressions vs pristine). Behavior change (intentional, more correct): the 1-hop now PRESERVES node-attr dtypes (int stays int) instead of the full machinery's spurious int->float merge upcast — making pandas/cuDF consistent with the polars engine (which already preserved int). cuDF basic graph query now ~2 semi-joins on the resident frame instead of the BFS + ~31 drop_duplicates, capturing the GPU semijoin win. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… OTel span placement Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic fast paths) found 2 bugs: - BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns join keys; the chain fast paths + combine did not. Added _align_edge_endpoints (cast endpoints to node-id dtype for the traversal, restore output dtype to match pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop. - (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal _try_chain_fast_path probe (inserted between the decorator and def chain) instead of the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it. Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass (4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes). Row-order divergences the review also found (fast path returns table order vs the full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher- undefined, sets/values identical, no repo test depends on it) are claim-precision, not bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The generic _try_chain_fast_path short-circuited chain() before the policy hook dispatch, so a query that hits the fast path (node-only MATCH, single-hop) never fired prechain/postchain (and skipped per-op policy inspection) — observable behavior change vs the pre-fast-path engine. Gate the fast path on no-policy; policy-bearing queries take the full path. 64 policy tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… OTel span placement Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic fast paths) found 2 bugs: - BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns join keys; the chain fast paths + combine did not. Added _align_edge_endpoints (cast endpoints to node-id dtype for the traversal, restore output dtype to match pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop. - (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal _try_chain_fast_path probe (inserted between the decorator and def chain) instead of the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it. Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass (4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes). Row-order divergences the review also found (fast path returns table order vs the full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher- undefined, sets/values identical, no repo test depends on it) are claim-precision, not bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… OTel span placement Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic fast paths) found 2 bugs: - BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns join keys; the chain fast paths + combine did not. Added _align_edge_endpoints (cast endpoints to node-id dtype for the traversal, restore output dtype to match pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop. - (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal _try_chain_fast_path probe (inserted between the decorator and def chain) instead of the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it. Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass (4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes). Row-order divergences the review also found (fast path returns table order vs the full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher- undefined, sets/values identical, no repo test depends on it) are claim-precision, not bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… OTel span placement Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic fast paths) found 2 bugs: - BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns join keys; the chain fast paths + combine did not. Added _align_edge_endpoints (cast endpoints to node-id dtype for the traversal, restore output dtype to match pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop. - (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal _try_chain_fast_path probe (inserted between the decorator and def chain) instead of the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it. Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass (4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes). Row-order divergences the review also found (fast path returns table order vs the full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher- undefined, sets/values identical, no repo test depends on it) are claim-precision, not bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The generic single-hop fast path accepted edges with `prune_to_endpoints=True` (a public e()/e_forward()/e_reverse() kwarg) but returned BOTH endpoints, while the flag keeps only the arrival side (dest for forward, src for reverse). So `[n(), e_forward(prune_to_endpoints=True), n()]` silently returned the wrong node/edge set (verified: path 0→1→2→3 gave [0,1,2,3] vs correct [1,2,3]). Add the flag to the fast-path edge gate so it declines and falls through to the full path, which honors it. `is_simple_single_hop()` is intentionally left unchanged (its other callers depend on the current semantics) — the gate is the correct, surgical layer. Also document the deep-immutability invariant on the Cypher AST nodes: parse_cypher results are shared by reference via lru_cache, so a future mutable field would poison cache hits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Code-quality pass on the opt-base fast path: - Type the contract boundaries: _filter_edges_by_endpoint(edges_df: DataFrameT, nodes_df: Optional[DataFrameT], ...) -> DataFrameT; _try_chain_fast_path(g_in: Plottable, ops: List[ASTObject], engine_concrete: Engine, start_nodes: Optional[DataFrameT]). - Replace getattr(n0, "query", None) duck-typing with n0.query (ASTNode declares it; the isinstance(n0, ASTNode) guard narrows the type). - Convert Engine -> EngineAbstract once (engine_abs) for the filter_by_dict calls (its param is EngineAbstract|str), matching the existing materialize conversion. - Trim verbose comments to terse one/two-liners (fast-path docstring, .unique removal, prune gate, policy gate, temporal dtype gate, AST immutability invariant). No behavior change; mypy + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds a regression test for the fix in 2755744 (fast path must not skip policy hooks): asserts that for the exact fast-path-eligible shapes (`[n()]` and `[n(),e(),n()]`) the prechain/postchain/postload hooks all fire when a policy is installed, and that the fast path is still taken when no policy is present. Without a guard this regression was silent (the shapes that hit the fast path are precisely the ones the policy suite exercises). Also documents the policy-gate in the fast-path CHANGELOG entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The chain fast path (_try_chain_fast_path) shipped with only incidental coverage (existing hop/chain/golden suites); its core behaviors were never pinned directly. Adds focused, engine-parametrized locks: - test_fast_path_differential_parity_vs_full_path: fast path output == full (policy-forced BFS) path output by node/edge SET, across every accelerated shape (node-only, node-filter, predicate, fwd/rev/undirected 1-hop, src/dst/both-end filters) AND every bypass shape (hops=2, filtered-undirected, edge_match, named node). Uses an installed no-op policy as the equivalence oracle (any policy forces the full path). - test_fast_path_preserves_int_node_dtypes: hard-locks the feature promise (1-hop fast path keeps int node attrs as int); tolerant on the full path. - test_fast_path_gating_returns_none_for_ineligible: unit-level accept/decline contract for the gate (seeds, non-eager engines, edge_match, named/queried nodes, hops>1, 2-op chains all decline). - _gfql_series_is_list_like dtype-gate: direct unit coverage (numeric/bool short-circuit; real list/tuple objects still detected; stringified lists intentionally deferred to order_detect_stringified_list_series). pandas lanes run everywhere; cuDF lanes gated on TEST_CUDF=1 (dgx-spark). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The new _try_chain_fast_path helper was inserted between the @otel_traced("gfql.chain") decorator and def chain(), so the decorator silently migrated onto the fast-path helper: chain() stopped emitting its span entirely, and _chain_otel_attrs (which expects chain's signature) was invoked with the fast path's positional args — binding `gfql.validate_schema` to the start_nodes DataFrame. Only surfaces with OTEL enabled (attrs are computed lazily), so CI stayed green. Move the decorator back onto chain(). Adds test_chain_otel_span_attrs_mapped_correctly: enables otel via monkeypatch, captures the span, and asserts chain_len + that validate_schema is a bool (fails under the misplaced decorator, passes with it on chain()). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The #1650/#1651 dtype gate (numeric/bool/complex columns skip the spurious astype(str)+regex list/temporal scan) was duplicated across 4 sites with 3 different sentinel conventions ("O", None, factored). Consolidate the kind set into is_non_textual_scalar_dtype() in series_str_compat.py (the engine-poly Series-string layer these gates short-circuit) and call it from ordering.py (x2), pipeline.py, and cypher/result_postprocess.py. Byte-identical; adds a direct parametrized unit test for the helper across dtype kinds + None. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The 1-hop fast path took `edges = g._edges` wholesale and only intersected filtered endpoints with the node table for the NODE output — it never restricted the EDGES to those whose endpoints exist in the node table. The full BFS path enforces that implicitly via its edge<->node joins, so the two diverged whenever a supplied node table omits an edge endpoint (filtered/subset node frames, or nodes+edges loaded separately): nodes v=[0,1], edges (s,d)=[(0,1),(1,99)] # 99 absent [n(), e_forward(hops=1), n()] fast kept (1,99); full dropped it [n({'attr':2}), e_forward, n()] fast -> nodes[1] edges[(1,99)]; full -> EMPTY Restrict edges to both-endpoints-present before the endpoint filters, and dedup node ids on the result (the full path collapses dup rows via its merge). Now fast == full across dangling-endpoint and duplicate-id graphs; no change on well-formed graphs. Found by wave-2 differential review; tests added for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The wave-2 both-endpoints-present restriction used `.isin(node_ids)`, but pandas/cuDF `.isin` treat NaN as a matchable value (`NaN.isin([NaN])` is True). So a NaN node id "validated" a NaN edge endpoint and the fast path kept a dangling NaN edge that the full BFS path's joins (which never match NaN<->NaN) correctly drop. dropna() on node_ids before the isin restores convergence for forward/reverse on NaN-id graphs; well-formed and non-null-dangling graphs are unchanged. (NaN ids are malformed input; the full path's undirected branch is itself inconsistent there, so we match its directed/consistent behavior.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s, dup ids, NaN) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d; expr AST immutability note Hardening tests for the fast-path + dtype-gate optimizations: - prune_to_endpoints: add to the gating-decline list and as bypass shapes (fwd+rev) in the fast-vs-full differential — regression guard for the fast-path prune gate. - NaN endpoints: new test_fast_path_drops_nan_endpoint_edges pins that a NaN node id does not validate a NaN edge endpoint (the .dropna() guard), matching the full BFS path. - Tighten the differential test: bypass shapes now also assert they DECLINE the fast path (previously full-vs-full, vacuous for bypass cases). - Drift guard: assert filter_by_dict._is_numeric_dtype_safe agrees with the series_str_compat._NON_TEXTUAL_SCALAR_KINDS SSOT across all numpy dtype kinds (the two keep separate copies to avoid an import cycle; they must not desync). - cuDF lane for the numeric where_rows dtype-gate path (paired-coverage convention). Also document the deep-immutability invariant on ExprNode (parse_expr lru_cache shares results by reference). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…te (#1657) Code-quality pass on the parse-cache layer: - Condense the verbose dropna/dangling-edge and dup-node fast-path comments to terse one/two-liners; same for the SSOT dtype-kind note and the ExprNode immutability invariant. - Note in the SSOT comment that filter_by_dict's mirror is kept in sync by a test (not imported) to avoid an import cycle. No behavior change; mypy + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…policy-hook guard test) Root-cause fix for a real bug surfaced while hardening the fast-path/policy work: A chain/Cypher query over an edges-only graph (no node-id binding, `g._node is None`) takes the full traversal path — the node-only/single-hop fast path is skipped whenever an edge-id index must be synthesized (`added_edge_index`), or when a policy is attached. That path rebuilt `g_out` from `self` (the *unbound* input) to restore the original edge binding, but `self._node is None`, so the materialized node-id binding was dropped. The endpoint-reconciliation concat then synthesized a spurious `None`-named node column — a corrupt result on older pandas, and a hard `NotImplementedError` (void-block NA fill) on newer pandas (Python 3.14). Fix: carry the materialized `g._node` binding explicitly (`self.nodes(final_nodes_df, g._node)`) while still restoring the original edge binding. Full-path output now matches the fast path: correct single node column, valid `_node`, edge binding preserved. Pre-existing on master (not introduced by this PR's fast paths). Also adds a parametrized regression test pinning that BOTH chain fast-path shapes (node-only `[n()]` and single-hop `[n(),e(),n()]`) still fire `postload` under a policy AND return a valid, non-corrupt result on an edges-only graph — guarding both the policy-hook gate (2755744) and this node-binding fix. CHANGELOG: documents both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@100

Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… OTel span placement Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic fast paths) found 2 bugs: - BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns join keys; the chain fast paths + combine did not. Added _align_edge_endpoints (cast endpoints to node-id dtype for the traversal, restore output dtype to match pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop. - (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal _try_chain_fast_path probe (inserted between the decorator and def chain) instead of the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it. Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass (4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes). Row-order divergences the review also found (fast path returns table order vs the full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher- undefined, sets/values identical, no repo test depends on it) are claim-precision, not bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov and others added 2 commits June 26, 2026 14:48

lmeyerov mentioned this pull request Jun 26, 2026

perf(gfql): dtype-gate temporal-text detection to avoid spurious stringification (#1650) #1651

Closed

lmeyerov mentioned this pull request Jun 27, 2026

perf(gfql): memoize row-expression parser (parse_expr) #1657

Merged

lmeyerov and others added 3 commits June 27, 2026 09:30

fix(gfql): preserve policy hooks in chain fast path

c0a61e8

lmeyerov and others added 3 commits June 28, 2026 17:55

lmeyerov and others added 6 commits June 28, 2026 17:55

docs(changelog): note fast-path correctness guarantees (dangling edge…

c2cc99c

…s, dup ids, NaN) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov merged commit 984e4df into master Jun 29, 2026
77 checks passed

lmeyerov deleted the dev/gfql-opt-base branch June 29, 2026 02:08

lmeyerov restored the dev/gfql-opt-base branch June 29, 2026 02:10

lmeyerov deleted the dev/gfql-opt-base branch June 29, 2026 02:11

lmeyerov mentioned this pull request Jun 29, 2026

Release v0.57.0 #1659

Merged

lmeyerov mentioned this pull request Jun 29, 2026

feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1660

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(gfql): general optimizations base (parse memoization + temporal dtype-gate) [PR0]#1652

perf(gfql): general optimizations base (parse memoization + temporal dtype-gate) [PR0]#1652
lmeyerov merged 20 commits into
masterfrom
dev/gfql-opt-base

lmeyerov commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lmeyerov commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Cypher parse memoization (4a627383)

2. Temporal-detection dtype gate (b539212c, folds in #1651, fixes spurious half of #1650)

Test

3. Additional generic-engine opts (folded in)

Fast-path review hardening (added via multi-wave adversarial review)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lmeyerov commented Jun 26, 2026 •

edited

Loading

1. Cypher parse memoization (`4a627383`)

2. Temporal-detection dtype gate (`b539212c`, folds in #1651, fixes spurious half of #1650)