Fix rpc hash race condition by yzang2019 · Pull Request #3693 · sei-protocol/sei-chain

yzang2019 · 2026-07-02T01:18:00Z

Fix

Serialize the hash-mutating operations on each memIAVL tree with the tree's
write lock, and add no-lock internal helpers so the public entry points
don't re-acquire the lock recursively.
sei-db/state_db/sc/memiavl/tree.go

RootHash() now takes t.mtx.Lock() (it mutates MemNode.hash), delegating to a new rootHashNoLock().
GetProof() now takes t.mtx.Lock() for the whole proof build, delegating to getProofNoLock().
Added no-lock read helpers used by the proof builders: getWithIndexNoLock(), getByIndexNoLock(), hasNoLock().
sei-db/state_db/sc/memiavl/proof.go
GetMembershipProof() / GetNonMembershipProof() now take the write lock and delegate to getMembershipProofNoLock() / getNonMembershipProofNoLock().
createExistenceProof() documented as "caller must hold the write lock".
The read-only fast paths (Get, GetWithIndex, GetByIndex, Has, Iterator)
are unchanged and still use RLock. Locking is per-tree (per-store).

Lock-ordering safety

Commit path acquires db.mtx then per-tree t.mtx; the query path acquires
only t.mtx. No inversion, no new deadlock.
No in-package caller invokes the now-locking methods while already holding
t.mtx (verified); internal callers use the *NoLock helpers.

Performance impact

Validators (no queries): the write lock is uncontended, so Lock() costs
the same as the previous RLock() (a single atomic op). RootHash runs
~once/twice per store per block → single-digit microseconds per block. No
measurable impact on block/consensus/tx-execution time.
Tx execution: unaffected — reads go through cache stores (RLock,
unchanged); GetProof is not on the execution path.
RPC nodes under heavy prove=true: a commit's RootHash and a proof build
now serialize per store. This is bounded (proof build is O(tree height))
and per-store (committing store A never blocks proofs on store B). It is the
necessary cost of correctness — the previous "cheaper" path was the bug.

Testing

Added sei-db/state_db/sc/memiavl/tree_race_test.go:

TestTreeProofRaceWithCommit — commit (Set + RootHash) concurrent with
membership/non-membership GetProof and RootHash.
TestTreeConcurrentRootHash — concurrent RootHash over a tree with unfilled
hash caches; asserts all callers agree.
Verification:
The tests were confirmed to reproduce the bug: reverting the two locks back
to RLock makes go test -race report DATA RACE on MemNode.hash.
With the fix, go test -race ./sei-db/state_db/sc/memiavl/... passes clean
(0 data races), as do the commitment and rootmulti suites.

Risk & rollout

Read-path behavior and hash outputs are unchanged; this is a locking-only fix.
Safe to roll out without coordination (not AppHash-format changing). It
prevents divergence rather than changing committed state.

* main: Add chain_id label to OTel metrics (#3692) Close temporary rootmulti store in channel types setup (#3690) Update changelog in prep to cut 6.6 RC1 (#3685) Pin priority-fee assertions to stable heads (#3680) Retry npm install for disable wasm integration test (#3674) State Store: Compact pruned key range after each prune (#3675)

cursor · 2026-07-02T01:18:45Z

PR Summary

High Risk
Touches consensus-critical AppHash computation and live proof RPC paths; incorrect locking could affect correctness or query latency under load, though the change is locking-only with unchanged hash semantics.

Overview
Fixes a concurrency bug in memIAVL where RootHash, GetProof, and membership/non-membership proof builders could run under a read lock while MemNode.Hash() writes lazy hash caches in place. That race (Immunefi 83246 / STO-601) could corrupt internal node hashes and diverge AppHash when commits overlap with prove=true queries.

Locking change: RootHash, GetProof, GetMembershipProof, and GetNonMembershipProof now take t.mtx write locks for the full operation. Pure reads (Get, Has, iterators) stay on RLock. Internal *NoLock helpers (rootHashNoLock, getProofNoLock, getWithIndexNoLock, etc.) avoid recursive lock acquisition when proof paths already hold the write lock.

Docs: Comments on mtx, proof builders, and test-only Verify* helpers clarify write-lock requirements and that verify helpers are not atomic across commits.

Tests: New tree_race_test.go stress-tests concurrent commit + GetProof/RootHash and concurrent RootHash on unfilled hash caches, intended to pass under go test -race.

^{Reviewed by Cursor Bugbot for commit a04e4d4. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-07-02T01:20:01Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Jul 2, 2026, 5:13 AM

codecov · 2026-07-02T01:22:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.62%. Comparing base (9f53204) to head (a04e4d4).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3693      +/-   ##
==========================================
- Coverage   59.30%   58.62%   -0.68%     
==========================================
  Files        2273     2204      -69     
  Lines      188305   181378    -6927     
==========================================
- Hits       111675   106335    -5340     
+ Misses      66579    65566    -1013     
+ Partials    10051     9477     -574

Flag	Coverage Δ
sei-db	`70.41% <ø> (ø)`
sei-db-state-db	`?`
sei-db-state-db-pr	`75.36% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sei-db/state_db/sc/memiavl/proof.go	`90.90% <100.00%> (+0.64%)`	⬆️
sei-db/state_db/sc/memiavl/tree.go	`87.98% <100.00%> (+0.80%)`	⬆️

... and 101 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

seidroid

A well-scoped, correct concurrency fix: it promotes the hash-mutating memIAVL operations (RootHash, GetProof, and the proof builders) from RLock to the write lock, since Node.Hash() lazily populates MemNode.hash in place, and adds *NoLock helpers to avoid recursive locking. Locking/ordering is verified safe and a race-reproducing test is included; only minor non-blocking notes remain.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

None at the file/PR level.

Non-blocking

VerifyMembership/VerifyNonMembership (proof.go:52-56, 109-111) still acquire the lock twice — Get() under RLock then RootHash() under Lock — so the returned value and root are not read atomically and a concurrent commit between the two can produce an inconsistent (val, root) pair. This is pre-existing (not introduced here) and these helpers appear to be test/verification-only, so it's not blocking, but worth a follow-up or a note.
Codex second-opinion pass reported "No material findings" (and noted it could not run the race test because the sandbox only had Go 1.24.13, not the required 1.25.6). The Cursor second-opinion file was empty — no output was produced by that pass.
Optional: consider a brief comment noting that this write-lock serialization protects same-tree concurrency; trees produced via Tree.Copy() get their own mutex, so cross-copy hash consistency relies on hashes already being populated (via SaveVersion(true)) and on cowVersion cloning before structural mutation. Not a defect given current usage, but documenting the assumption would help future callers.

seidroid

A correct, tightly-scoped locking fix that serializes the in-place MemNode.hash mutations in RootHash/GetProof/GetMembershipProof/GetNonMembershipProof under the tree's write lock, closing the AppHash-divergence data race (Immunefi 83246 / STO-601). No blocking issues found; only minor notes.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

None at the file/PR level.

Non-blocking

The cursor-review.md second-opinion pass produced no output (empty file), so no Cursor findings were available to merge.
The codex-review.md pass reported no material findings but explicitly could not run the targeted -race tests (Go 1.25.6 toolchain download blocked by the sandbox network policy), so its clean result is based on static reading only, not test execution.
The fix only serializes hash mutations on a single *Tree instance (per-tree mtx). The author correctly documents in the mtx comment that cross-Copy() consistency relies on the pre-existing invariants (shared nodes fully hashed at commit before a copy reads them, and cowVersion cloning before structural mutation). This reasoning holds for the current code but is an unenforced invariant — worth a follow-up assertion/test if copy-based latest-height proof serving is ever added.
Performance tradeoff is real but acknowledged: on RPC nodes under heavy prove=true load, a commit's RootHash now serializes per-store against concurrent proof builds. Bounded (O(tree height)) and per-store, so acceptable as the cost of correctness.
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-02T01:36:32Z

+// TestTreeConcurrentRootHash covers concurrent RootHash() calls over a tree
+// whose freshly-inserted nodes still have an empty hash cache, so every caller
+// races to populate MemNode.hash. The digest is deterministic, so all callers
+// must agree and (under -race) must not race.


[nit] nit: TestTreeConcurrentRootHash derives its value almost entirely from go test -race detecting the concurrent lazy-fill of MemNode.hash. The require.Equal assertion on the resulting hashes would pass even with the buggy RLock version most of the time (the race is on the cache write, not necessarily the final digest). Consider a comment noting the test is only meaningful under -race, to prevent it being mistaken for a functional-correctness assertion.

seidroid

A correct, well-documented locking-only fix that closes a real MemNode.hash data race (Immunefi 83246 / STO-601) by moving RootHash/GetProof/proof-builders to the write lock with non-reentrant *NoLock helpers; no reentrancy deadlocks, and tests reproduce the race and pass under -race. The only open items are a documented residual cross-Copy limitation (raised by Codex) and a bounded per-store serialization cost for proof queries.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

None at the file/PR level.

Non-blocking

Codex flagged that the per-tree write lock does not synchronize a tree against its Copy() (each gets a fresh mutex over a shared MemNode graph). This is a valid observation but appears non-reachable in practice: DB.Copy() and DB.Commit() serialize on db.mtx, Commit ends with SaveVersion(true) (full hashing), and COW (cowVersion) makes the live tree clone-before-mutate and only hash newly-created private nodes — so a Copy always observes fully-hashed shared subtrees and Hash() returns early. The author documented this invariant in the mtx comment. Not blocking, but worth a follow-up assertion/test to guard the 'fully hashed before Copy' invariant.
Performance tradeoff (acknowledged in the PR description): GetProof now holds the write lock for the whole proof build, so concurrent prove=true proof queries on the same store fully serialize. Acceptable for correctness; a future optimization could prehash so proofs can run under RLock again.
Second-opinion pass: the Cursor review file (cursor-review.md) was empty — no findings from that tool. REVIEW_GUIDELINES.md was also empty, so no repo-specific standards were applied.

claude · 2026-07-02T05:29:45Z

+	// mtx guards concurrent access to this tree's mutable state (root, version,
+	// cowVersion, snapshot) AND the lazily-populated MemNode.hash caches reachable
+	// from root. Operations that fill those caches in place — RootHash and the
+	// proof builders (GetProof/GetMembership/GetNonMembership) — take the write
+	// lock; pure reads (Get/Has/Iterator) take the read lock. This serialization
+	// only protects a single tree instance: a tree produced by Copy() gets its
+	// own mtx and shares the underlying nodes copy-on-write. Cross-copy hash
+	// consistency therefore relies on (a) the shared nodes already being fully
+	// hashed before the copy is used for hashing/proofs (Copy is taken between
+	// commits, and the commit path hashes via SaveVersion(true)/RootHash), and
+	// (b) cowVersion cloning any shared MemNode before it is structurally mutated,
+	// so a live-tree write never mutates a node another copy is still reading.


🟡 [nit] The new mtx docstring (tree.go:33-44) claims mtx guards root, version, cowVersion, snapshot in addition to the MemNode.hash caches, but the surrounding code doesn't consistently enforce that on the peripheral fields — most notably Copy() (line 119) writes t.cowVersion = t.version while holding only RLock, which is exactly the write-under-RLock anti-pattern the rest of this PR is closing for MemNode.hash. Version(), SaveVersion(), SetInitialVersion(), and SetZeroCopy() also read/write the enumerated fields with no lock. These are all pre-existing and not worth expanding the PR to fix, but since the new docstring will be treated as authoritative by future maintainers, worth tightening the wording to just what the write-lock ops actually cover (root + MemNode.hash caches) and calling out the known gaps on version/cowVersion.

Extended reasoning...

What the docstring claims vs what the code does

The PR replaces the old one-line comment on mtx with a much more detailed contract (tree.go:33-44) that reads, in part:

mtx guards concurrent access to this tree's mutable state (root, version, cowVersion, snapshot) AND the lazily-populated MemNode.hash caches reachable from root.

The load-bearing half of that claim — that hash-mutating operations (RootHash, GetProof, GetMembership/NonMembership) now take the write lock, and pure reads keep the read lock — is exactly what the PR delivers and is accurate. This comment is only about the peripheral field enumeration.

The Copy() write-under-RLock

Copy() (tree.go:114-127) takes RLock() and at line 119 writes t.cowVersion = t.version. Under Go's memory model this is a data race even though both concurrent callers would write the same value — the write is unsynchronized between the RLock holders, and -race would flag it. This is the same class of anti-pattern the PR is closing for MemNode.hash, applied to a field the new docstring newly asserts mtx guards.

Step-by-step: goroutine A calls Copy(), takes RLock, is scheduled out just before line 119. Goroutine B calls Copy(), takes RLock (RLocks compose), executes t.cowVersion = t.version at line 119. Goroutine A resumes and also writes t.cowVersion = t.version. Two writers, no synchronizing acquire/release between them → data race. In practice the effect is benign (idempotent value), but the docstring newly labels this exact pattern as forbidden.

Other unlocked accesses to enumerated fields

Version() (line 200-202) reads t.version with no lock.

SaveVersion() (line 185-197) reads t.version at 186 and writes it at 195; the intervening RootHash() acquires+releases the write lock, but the surrounding reads/writes are unlocked.

SetInitialVersion() (line 107) writes t.initialVersion with no lock (not listed in the docstring, so weaker).

SetZeroCopy() (line 96) writes t.zeroCopy with no lock (also not listed).

Addressing the refutation

The refuter is right that these are pre-existing, that in practice SaveVersion is single-writer on the consensus path, and that the Copy() cowVersion write happens to be idempotent. This is why the severity here is nit, not normal — I'm not asking the author to scope-creep tighter locking onto version/cowVersion for a race-fix hotfix. But the refuter also says the enumeration is "describing the intended contract in a comment ABOVE the mutex declaration; it is not a precondition/postcondition on any function" — that's precisely why it's worth flagging. A future maintainer reading "mtx guards … cowVersion" will reasonably assume the existing code enforces that, and won't notice they need to take the write lock when adding a new mutator, or that Copy's RLock is already violating it.

Suggested fix

Tightening the docstring is a one-line change and matches what the PR actually establishes. Something like:

mtx guards concurrent access to this tree's root pointer and the lazily-populated MemNode.hash caches reachable from root. (Note: version/cowVersion/snapshot are also mutated on this tree but are not currently protected by mtx — Copy() writes cowVersion under RLock, and Version()/SaveVersion() read/write version unlocked. These are pre-existing gaps out of scope for this fix.)

Alternatively, promote Copy()'s RLock to Lock() as a small in-scope tweak, since that specific write-under-RLock is the same anti-pattern the PR is closing elsewhere.

yzang2019 added 2 commits July 1, 2026 07:58

Fix race condition for memIAVL hash update

c937056

yzang2019 requested review from Kbhat1, masih and sei-will July 2, 2026 01:18

yzang2019 added the non-app-hash-breaking label Jul 2, 2026

seidroid Bot approved these changes Jul 2, 2026

View reviewed changes

UPdate some comments

5d7fd54

seidroid Bot approved these changes Jul 2, 2026

View reviewed changes

Merge branch 'main' into yzang/fix-rpc-apphash

a04e4d4

seidroid Bot approved these changes Jul 2, 2026

View reviewed changes

claude Bot reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix rpc hash race condition#3693

Fix rpc hash race condition#3693
yzang2019 wants to merge 4 commits into
mainfrom
yzang/fix-rpc-apphash

yzang2019 commented Jul 2, 2026

Uh oh!

cursor Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

seidroid Bot left a comment

Uh oh!

seidroid Bot left a comment

Uh oh!

seidroid Bot Jul 2, 2026

Uh oh!

seidroid Bot left a comment

Uh oh!

claude Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yzang2019 commented Jul 2, 2026

Fix

Lock-ordering safety

Performance impact

Testing

Risk & rollout

Uh oh!

cursor Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

claude Bot Jul 2, 2026

Choose a reason for hiding this comment

What the docstring claims vs what the code does

The Copy() write-under-RLock

Other unlocked accesses to enumerated fields

Addressing the refutation

Suggested fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

codecov Bot commented Jul 2, 2026 •

edited

Loading