Fix rpc hash race condition#3693
Conversation
* main: Add chain_id label to OTel metrics (#3692) Close temporary rootmulti store in channel types setup (#3690) Update changelog in prep to cut 6.6 RC1 (#3685) Pin priority-fee assertions to stable heads (#3680) Retry npm install for disable wasm integration test (#3674) State Store: Compact pruned key range after each prune (#3675)
PR SummaryHigh Risk Overview Locking change: Docs: Comments on Tests: New Reviewed by Cursor Bugbot for commit a04e4d4. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3693 +/- ##
==========================================
- Coverage 59.30% 58.62% -0.68%
==========================================
Files 2273 2204 -69
Lines 188305 181378 -6927
==========================================
- Hits 111675 106335 -5340
+ Misses 66579 65566 -1013
+ Partials 10051 9477 -574
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
A well-scoped, correct concurrency fix: it promotes the hash-mutating memIAVL operations (RootHash, GetProof, and the proof builders) from RLock to the write lock, since Node.Hash() lazily populates MemNode.hash in place, and adds *NoLock helpers to avoid recursive locking. Locking/ordering is verified safe and a race-reproducing test is included; only minor non-blocking notes remain.
Findings: 0 blocking | 3 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- VerifyMembership/VerifyNonMembership (proof.go:52-56, 109-111) still acquire the lock twice — Get() under RLock then RootHash() under Lock — so the returned value and root are not read atomically and a concurrent commit between the two can produce an inconsistent (val, root) pair. This is pre-existing (not introduced here) and these helpers appear to be test/verification-only, so it's not blocking, but worth a follow-up or a note.
- Codex second-opinion pass reported "No material findings" (and noted it could not run the race test because the sandbox only had Go 1.24.13, not the required 1.25.6). The Cursor second-opinion file was empty — no output was produced by that pass.
- Optional: consider a brief comment noting that this write-lock serialization protects same-tree concurrency; trees produced via Tree.Copy() get their own mutex, so cross-copy hash consistency relies on hashes already being populated (via SaveVersion(true)) and on cowVersion cloning before structural mutation. Not a defect given current usage, but documenting the assumption would help future callers.
There was a problem hiding this comment.
A correct, tightly-scoped locking fix that serializes the in-place MemNode.hash mutations in RootHash/GetProof/GetMembershipProof/GetNonMembershipProof under the tree's write lock, closing the AppHash-divergence data race (Immunefi 83246 / STO-601). No blocking issues found; only minor notes.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- The
cursor-review.mdsecond-opinion pass produced no output (empty file), so no Cursor findings were available to merge. - The
codex-review.mdpass reported no material findings but explicitly could not run the targeted-racetests (Go 1.25.6 toolchain download blocked by the sandbox network policy), so its clean result is based on static reading only, not test execution. - The fix only serializes hash mutations on a single
*Treeinstance (per-tree mtx). The author correctly documents in themtxcomment that cross-Copy()consistency relies on the pre-existing invariants (shared nodes fully hashed at commit before a copy reads them, and cowVersion cloning before structural mutation). This reasoning holds for the current code but is an unenforced invariant — worth a follow-up assertion/test if copy-based latest-height proof serving is ever added. - Performance tradeoff is real but acknowledged: on RPC nodes under heavy
prove=trueload, a commit'sRootHashnow serializes per-store against concurrent proof builds. Bounded (O(tree height)) and per-store, so acceptable as the cost of correctness. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
| // TestTreeConcurrentRootHash covers concurrent RootHash() calls over a tree | ||
| // whose freshly-inserted nodes still have an empty hash cache, so every caller | ||
| // races to populate MemNode.hash. The digest is deterministic, so all callers | ||
| // must agree and (under -race) must not race. |
There was a problem hiding this comment.
[nit] nit: TestTreeConcurrentRootHash derives its value almost entirely from go test -race detecting the concurrent lazy-fill of MemNode.hash. The require.Equal assertion on the resulting hashes would pass even with the buggy RLock version most of the time (the race is on the cache write, not necessarily the final digest). Consider a comment noting the test is only meaningful under -race, to prevent it being mistaken for a functional-correctness assertion.
There was a problem hiding this comment.
A correct, well-documented locking-only fix that closes a real MemNode.hash data race (Immunefi 83246 / STO-601) by moving RootHash/GetProof/proof-builders to the write lock with non-reentrant *NoLock helpers; no reentrancy deadlocks, and tests reproduce the race and pass under -race. The only open items are a documented residual cross-Copy limitation (raised by Codex) and a bounded per-store serialization cost for proof queries.
Findings: 0 blocking | 3 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Codex flagged that the per-tree write lock does not synchronize a tree against its Copy() (each gets a fresh mutex over a shared MemNode graph). This is a valid observation but appears non-reachable in practice: DB.Copy() and DB.Commit() serialize on db.mtx, Commit ends with SaveVersion(true) (full hashing), and COW (cowVersion) makes the live tree clone-before-mutate and only hash newly-created private nodes — so a Copy always observes fully-hashed shared subtrees and Hash() returns early. The author documented this invariant in the mtx comment. Not blocking, but worth a follow-up assertion/test to guard the 'fully hashed before Copy' invariant.
- Performance tradeoff (acknowledged in the PR description): GetProof now holds the write lock for the whole proof build, so concurrent prove=true proof queries on the same store fully serialize. Acceptable for correctness; a future optimization could prehash so proofs can run under RLock again.
- Second-opinion pass: the Cursor review file (cursor-review.md) was empty — no findings from that tool. REVIEW_GUIDELINES.md was also empty, so no repo-specific standards were applied.
| // mtx guards concurrent access to this tree's mutable state (root, version, | ||
| // cowVersion, snapshot) AND the lazily-populated MemNode.hash caches reachable | ||
| // from root. Operations that fill those caches in place — RootHash and the | ||
| // proof builders (GetProof/GetMembership/GetNonMembership) — take the write | ||
| // lock; pure reads (Get/Has/Iterator) take the read lock. This serialization | ||
| // only protects a single tree instance: a tree produced by Copy() gets its | ||
| // own mtx and shares the underlying nodes copy-on-write. Cross-copy hash | ||
| // consistency therefore relies on (a) the shared nodes already being fully | ||
| // hashed before the copy is used for hashing/proofs (Copy is taken between | ||
| // commits, and the commit path hashes via SaveVersion(true)/RootHash), and | ||
| // (b) cowVersion cloning any shared MemNode before it is structurally mutated, | ||
| // so a live-tree write never mutates a node another copy is still reading. |
There was a problem hiding this comment.
🟡 [nit] The new mtx docstring (tree.go:33-44) claims mtx guards root, version, cowVersion, snapshot in addition to the MemNode.hash caches, but the surrounding code doesn't consistently enforce that on the peripheral fields — most notably Copy() (line 119) writes t.cowVersion = t.version while holding only RLock, which is exactly the write-under-RLock anti-pattern the rest of this PR is closing for MemNode.hash. Version(), SaveVersion(), SetInitialVersion(), and SetZeroCopy() also read/write the enumerated fields with no lock. These are all pre-existing and not worth expanding the PR to fix, but since the new docstring will be treated as authoritative by future maintainers, worth tightening the wording to just what the write-lock ops actually cover (root + MemNode.hash caches) and calling out the known gaps on version/cowVersion.
Extended reasoning...
What the docstring claims vs what the code does
The PR replaces the old one-line comment on mtx with a much more detailed contract (tree.go:33-44) that reads, in part:
mtx guards concurrent access to this tree's mutable state (root, version, cowVersion, snapshot) AND the lazily-populated MemNode.hash caches reachable from root.
The load-bearing half of that claim — that hash-mutating operations (RootHash, GetProof, GetMembership/NonMembership) now take the write lock, and pure reads keep the read lock — is exactly what the PR delivers and is accurate. This comment is only about the peripheral field enumeration.
The Copy() write-under-RLock
Copy() (tree.go:114-127) takes RLock() and at line 119 writes t.cowVersion = t.version. Under Go's memory model this is a data race even though both concurrent callers would write the same value — the write is unsynchronized between the RLock holders, and -race would flag it. This is the same class of anti-pattern the PR is closing for MemNode.hash, applied to a field the new docstring newly asserts mtx guards.
Step-by-step: goroutine A calls Copy(), takes RLock, is scheduled out just before line 119. Goroutine B calls Copy(), takes RLock (RLocks compose), executes t.cowVersion = t.version at line 119. Goroutine A resumes and also writes t.cowVersion = t.version. Two writers, no synchronizing acquire/release between them → data race. In practice the effect is benign (idempotent value), but the docstring newly labels this exact pattern as forbidden.
Other unlocked accesses to enumerated fields
Version()(line 200-202) readst.versionwith no lock.SaveVersion()(line 185-197) readst.versionat 186 and writes it at 195; the interveningRootHash()acquires+releases the write lock, but the surrounding reads/writes are unlocked.SetInitialVersion()(line 107) writest.initialVersionwith no lock (not listed in the docstring, so weaker).SetZeroCopy()(line 96) writest.zeroCopywith no lock (also not listed).
Addressing the refutation
The refuter is right that these are pre-existing, that in practice SaveVersion is single-writer on the consensus path, and that the Copy() cowVersion write happens to be idempotent. This is why the severity here is nit, not normal — I'm not asking the author to scope-creep tighter locking onto version/cowVersion for a race-fix hotfix. But the refuter also says the enumeration is "describing the intended contract in a comment ABOVE the mutex declaration; it is not a precondition/postcondition on any function" — that's precisely why it's worth flagging. A future maintainer reading "mtx guards … cowVersion" will reasonably assume the existing code enforces that, and won't notice they need to take the write lock when adding a new mutator, or that Copy's RLock is already violating it.
Suggested fix
Tightening the docstring is a one-line change and matches what the PR actually establishes. Something like:
mtx guards concurrent access to this tree's root pointer and the lazily-populated MemNode.hash caches reachable from root. (Note: version/cowVersion/snapshot are also mutated on this tree but are not currently protected by mtx — Copy() writes cowVersion under RLock, and Version()/SaveVersion() read/write version unlocked. These are pre-existing gaps out of scope for this fix.)
Alternatively, promote Copy()'s RLock to Lock() as a small in-scope tweak, since that specific write-under-RLock is the same anti-pattern the PR is closing elsewhere.
Fix
Serialize the hash-mutating operations on each memIAVL tree with the tree's
write lock, and add no-lock internal helpers so the public entry points
don't re-acquire the lock recursively.
sei-db/state_db/sc/memiavl/tree.goRootHash()now takest.mtx.Lock()(it mutatesMemNode.hash), delegating to a newrootHashNoLock().GetProof()now takest.mtx.Lock()for the whole proof build, delegating togetProofNoLock().getWithIndexNoLock(),getByIndexNoLock(),hasNoLock().sei-db/state_db/sc/memiavl/proof.goGetMembershipProof()/GetNonMembershipProof()now take the write lock and delegate togetMembershipProofNoLock()/getNonMembershipProofNoLock().createExistenceProof()documented as "caller must hold the write lock".The read-only fast paths (
Get,GetWithIndex,GetByIndex,Has,Iterator)are unchanged and still use
RLock. Locking is per-tree (per-store).Lock-ordering safety
db.mtxthen per-treet.mtx; the query path acquiresonly
t.mtx. No inversion, no new deadlock.t.mtx(verified); internal callers use the*NoLockhelpers.Performance impact
Lock()coststhe same as the previous
RLock()(a single atomic op).RootHashruns~once/twice per store per block → single-digit microseconds per block. No
measurable impact on block/consensus/tx-execution time.
RLock,unchanged);
GetProofis not on the execution path.prove=true: a commit'sRootHashand a proof buildnow serialize per store. This is bounded (proof build is O(tree height))
and per-store (committing store A never blocks proofs on store B). It is the
necessary cost of correctness — the previous "cheaper" path was the bug.
Testing
Added
sei-db/state_db/sc/memiavl/tree_race_test.go:TestTreeProofRaceWithCommit— commit (Set+RootHash) concurrent withmembership/non-membership
GetProofandRootHash.TestTreeConcurrentRootHash— concurrentRootHashover a tree with unfilledhash caches; asserts all callers agree.
Verification:
to
RLockmakesgo test -racereportDATA RACEonMemNode.hash.go test -race ./sei-db/state_db/sc/memiavl/...passes clean(0 data races), as do the
commitmentandrootmultisuites.Risk & rollout
prevents divergence rather than changing committed state.