perf(executor): per-ExecutionCtx ArrayKernels snapshot#8401
Conversation
Resolve ArrayKernels once at ExecutionCtx construction into a KernelSnapshot (an ArcSwapMap::load_full of the execute-parent kernel map) instead of cloning the session and probing the sharded session-variable DashMap on every array node in the execute_until / single-step paths. This also stops holding the session-variable read guard across kernel invocation. The registry is session-scoped and mutable via its public register_* methods: an ExecutionCtx sees a point-in-time snapshot taken at construction, and later registrations are picked up by the next context (contexts are created per evaluation). Adds a pub(crate) ArcSwapMap::load_full accessor; the KernelSnapshot type and ArrayKernels::snapshot() are pub(crate), so no new public API is added. Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
20.7 µs | 36.3 µs | -43.08% |
| ❌ | Simulation | compare[48] |
213 µs | 300.4 µs | -29.11% |
| ❌ | Simulation | compare[50] |
227.7 µs | 318.9 µs | -28.58% |
| ❌ | Simulation | compare[49] |
228.2 µs | 317.5 µs | -28.13% |
| ❌ | Simulation | compare[46] |
218.5 µs | 302.2 µs | -27.69% |
| ❌ | Simulation | compare[47] |
223.5 µs | 309.1 µs | -27.68% |
| ❌ | Simulation | compare[44] |
212.2 µs | 292.1 µs | -27.37% |
| ❌ | Simulation | compare[45] |
218.9 µs | 300.7 µs | -27.21% |
| ❌ | Simulation | compare[40] |
195.6 µs | 267.3 µs | -26.82% |
| ❌ | Simulation | compare[43] |
214.2 µs | 292.3 µs | -26.71% |
| ❌ | Simulation | compare[42] |
209.4 µs | 285.6 µs | -26.68% |
| ❌ | Simulation | compare[41] |
209.3 µs | 283.8 µs | -26.22% |
| ❌ | Simulation | compare[39] |
204.7 µs | 274.5 µs | -25.43% |
| ❌ | Simulation | compare[38] |
200.3 µs | 268.3 µs | -25.33% |
| ❌ | Simulation | compare[32] |
173.4 µs | 231 µs | -24.96% |
| ❌ | Simulation | compare[36] |
194 µs | 258.2 µs | -24.86% |
| ❌ | Simulation | compare[37] |
200.1 µs | 266.2 µs | -24.83% |
| ❌ | Simulation | compare[35] |
195.4 µs | 257.7 µs | -24.19% |
| ❌ | Simulation | compare[34] |
191.1 µs | 251.6 µs | -24.03% |
| ❌ | Simulation | compare[33] |
190.6 µs | 250.1 µs | -23.78% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing spiceai:lukim/exec-kernel-snapshot-develop (2820b17) with develop (d0013ff)
Footnotes
-
10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Summary
Resolves
ArrayKernelsonce atExecutionCtxconstruction into a crate-privateKernelSnapshot(a singleArcSwapMap::load_fullof the execute-parent kernel map), instead of cloning the session and probing the sharded session-variableDashMapon every array node in theexecute_until/ single-step execution paths.Why
DashMapshardRwLockprobe from the hot execution loop.Semantics
The registry is session-scoped and mutable via its public
register_*methods. AnExecutionCtxsees a point-in-time snapshot taken at construction; later registrations are picked up by the next context (contexts are created per evaluation). Kernel lookup order is unchanged: registered plugin kernels are tried before staticexecute_parentkernels, with the same(parent, child)hashing.Adds a
pub(crate)ArcSwapMap::load_fullaccessor (snapshot-that-outlives-the-call, complementingread).KernelSnapshotandArrayKernels::snapshot()arepub(crate)— no new public API.Testing
cargo nextest run -p vortex-array— 2893 passed (includesstruct_cast_execute_parent_uses_session_plugin, covering the register → snapshot → execute path).cargo clippy --all-targets -p vortex-array— clean on the spiceai-53 variant of this change; develop port re-verified via the full test build.