Skip to content

fix(avm): avoid data race on shared interaction selector writes (port #24456 to v5-next)#24457

Merged
fcarreiro merged 1 commit into
v5-nextfrom
fc/avm-safer-interactions-v5
Jul 2, 2026
Merged

fix(avm): avoid data race on shared interaction selector writes (port #24456 to v5-next)#24457
fcarreiro merged 1 commit into
v5-nextfrom
fc/avm-safer-interactions-v5

Conversation

@fcarreiro

Copy link
Copy Markdown
Contributor

Ports #24456 to v5-next.

What was wrong

The interactions tracegen phase runs every lookup/permutation job concurrently: AvmTraceGenHelper::fill_trace_interactions concatenates all builders' jobs and dispatches them with parallel_for. Lookups whose fine-grained destination selector is a shared column (DST_SELECTOR != outer_dst_selector) can resolve to the same dst_row from different jobs, so multiple threads write the same (selector, row) cell at once.

The previous code did a guarded read-modify-write on that shared cell:

if (DST_SELECTOR != outer_dst_selector && trace.get(DST_SELECTOR, dst_row) != 1) {
    trace.set(DST_SELECTOR, dst_row, 1);
}

Both the get and the non-atomic 32-byte set race against the other threads' writes to the same cell — a data race, i.e. undefined behavior. In practice it produced the right value (every writer stores 1), but it is still UB.

The fix

Add an opt-in use_atomic_limbs flag to TraceContainer::set. When set, the field's four 64-bit limbs are written with relaxed atomic stores — four plain movq on x86-64, no lock and no libatomic call. (A whole-field std::atomic_ref<FF> is not an option on the hot path: at 32 bytes it exceeds the hardware lock-free width and falls back to a locked libatomic call.)

The shared selector write now uses it and drops the guard read:

if (DST_SELECTOR != outer_dst_selector) {
    trace.set(DST_SELECTOR, dst_row, 1, /*use_atomic_limbs=*/true);
}

The write is an unconditional, idempotent atomic store of 1. Because every concurrent writer stores the same value, per-limb atomicity is sufficient — no torn value is possible — so this is data-race-free without the cost of whole-field atomicity. The default (non-atomic) set hot path is untouched and keeps its sparse-column "zero = absent" fast path.

Performance (this PR vs baseline)

Mega bulk AVM tx, full proving, HARDWARE_CONCURRENCY=16. Tracegen stage timings, median of runs (baseline n=3, this PR n=6):

Stage baseline this PR
tracegen traces 1,536 ms 1,537 ms
tracegen interactions 350 ms 329 ms
tracegen all 1,917 ms 1,936 ms

No measurable cost — the safety fix is free. The per-limb atomic only fires on the shared selector writes in the interactions stage; the traces stage is byte-for-byte the same hot path as baseline.

The interactions tracegen phase runs every lookup/permutation job via parallel_for (fill_trace_interactions). Lookups that share a destination-selector column can write the same (selector, row) cell from multiple threads, so the previous guarded read-modify-write (get(...) != 1 then set(...)) on that shared cell is a data race (undefined behavior).

Add an opt-in use_atomic_limbs flag to TraceContainer::set that stores the field's four limbs with relaxed atomic writes (4 plain movq on x86-64, no lock -- unlike a whole-field std::atomic_ref<FF>, whose 32 bytes exceed the lock-free width). Use it for the shared selector write and drop the guard read: the write is now an unconditional idempotent atomic store of 1. Because every writer stores the same value, per-limb atomicity is sufficient (no torn value is possible), and the hot non-atomic set() path is unchanged.
@fcarreiro fcarreiro added this pull request to the merge queue Jul 2, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Jul 2, 2026
@fcarreiro fcarreiro added this pull request to the merge queue Jul 2, 2026
@AztecBot

AztecBot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/812cb13379eef261�812cb13379eef2618;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/single-node/proving/proof_fails.parallel.test.ts "does not allow submitting proof after epoch end" (117s) (code: 0) group:e2e-p2p-epoch-flakes

Merged via the queue into v5-next with commit 413a8d2 Jul 2, 2026
14 checks passed
@fcarreiro fcarreiro deleted the fc/avm-safer-interactions-v5 branch July 2, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants