Skip to content

perf(executor): per-ExecutionCtx ArrayKernels snapshot#8401

Open
lukekim wants to merge 2 commits into
vortex-data:developfrom
spiceai:lukim/exec-kernel-snapshot-develop
Open

perf(executor): per-ExecutionCtx ArrayKernels snapshot#8401
lukekim wants to merge 2 commits into
vortex-data:developfrom
spiceai:lukim/exec-kernel-snapshot-develop

Conversation

@lukekim

@lukekim lukekim commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves ArrayKernels once at ExecutionCtx construction into a crate-private KernelSnapshot (a single ArcSwapMap::load_full of the execute-parent kernel map), instead of cloning the session and probing the sharded session-variable DashMap on every array node in the execute_until / single-step execution paths.

Why

  • Removes a per-array-node session clone + DashMap shard RwLock probe from the hot execution loop.
  • Stops holding the session-variable read guard across kernel invocation (previously a plugin kernel touching the session registry could contend/deadlock on the same shard).

Semantics

The registry is session-scoped and mutable via its public register_* methods. An ExecutionCtx sees a point-in-time snapshot taken at construction; later registrations are picked up by the next context (contexts are created per evaluation). Kernel lookup order is unchanged: registered plugin kernels are tried before static execute_parent kernels, with the same (parent, child) hashing.

Adds a pub(crate) ArcSwapMap::load_full accessor (snapshot-that-outlives-the-call, complementing read). KernelSnapshot and ArrayKernels::snapshot() are pub(crate) — no new public API.

Testing

  • cargo nextest run -p vortex-array — 2893 passed (includes struct_cast_execute_parent_uses_session_plugin, covering the register → snapshot → execute path).
  • cargo clippy --all-targets -p vortex-array — clean on the spiceai-53 variant of this change; develop port re-verified via the full test build.

Resolve ArrayKernels once at ExecutionCtx construction into a KernelSnapshot
(an ArcSwapMap::load_full of the execute-parent kernel map) instead of
cloning the session and probing the sharded session-variable DashMap on
every array node in the execute_until / single-step paths. This also stops
holding the session-variable read guard across kernel invocation.

The registry is session-scoped and mutable via its public register_* methods:
an ExecutionCtx sees a point-in-time snapshot taken at construction, and
later registrations are picked up by the next context (contexts are created
per evaluation). Adds a pub(crate) ArcSwapMap::load_full accessor; the
KernelSnapshot type and ArrayKernels::snapshot() are pub(crate), so no new
public API is added.

Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>
@lukekim lukekim requested a review from a team June 12, 2026 21:21
@gatesn gatesn added the action/benchmark Trigger full benchmarks to run on this PR label Jun 12, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 12, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 52 improved benchmarks
❌ 62 regressed benchmarks
✅ 1423 untouched benchmarks
⏩ 10 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 20.7 µs 36.3 µs -43.08%
Simulation compare[48] 213 µs 300.4 µs -29.11%
Simulation compare[50] 227.7 µs 318.9 µs -28.58%
Simulation compare[49] 228.2 µs 317.5 µs -28.13%
Simulation compare[46] 218.5 µs 302.2 µs -27.69%
Simulation compare[47] 223.5 µs 309.1 µs -27.68%
Simulation compare[44] 212.2 µs 292.1 µs -27.37%
Simulation compare[45] 218.9 µs 300.7 µs -27.21%
Simulation compare[40] 195.6 µs 267.3 µs -26.82%
Simulation compare[43] 214.2 µs 292.3 µs -26.71%
Simulation compare[42] 209.4 µs 285.6 µs -26.68%
Simulation compare[41] 209.3 µs 283.8 µs -26.22%
Simulation compare[39] 204.7 µs 274.5 µs -25.43%
Simulation compare[38] 200.3 µs 268.3 µs -25.33%
Simulation compare[32] 173.4 µs 231 µs -24.96%
Simulation compare[36] 194 µs 258.2 µs -24.86%
Simulation compare[37] 200.1 µs 266.2 µs -24.83%
Simulation compare[35] 195.4 µs 257.7 µs -24.19%
Simulation compare[34] 191.1 µs 251.6 µs -24.03%
Simulation compare[33] 190.6 µs 250.1 µs -23.78%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing spiceai:lukim/exec-kernel-snapshot-develop (2820b17) with develop (d0013ff)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/benchmark Trigger full benchmarks to run on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants