Skip to content

Fix rpc hash race condition#3693

Open
yzang2019 wants to merge 4 commits into
mainfrom
yzang/fix-rpc-apphash
Open

Fix rpc hash race condition#3693
yzang2019 wants to merge 4 commits into
mainfrom
yzang/fix-rpc-apphash

Conversation

@yzang2019

Copy link
Copy Markdown
Contributor

Fix

Serialize the hash-mutating operations on each memIAVL tree with the tree's
write lock, and add no-lock internal helpers so the public entry points
don't re-acquire the lock recursively.
sei-db/state_db/sc/memiavl/tree.go

  • RootHash() now takes t.mtx.Lock() (it mutates MemNode.hash), delegating to a new rootHashNoLock().
  • GetProof() now takes t.mtx.Lock() for the whole proof build, delegating to getProofNoLock().
  • Added no-lock read helpers used by the proof builders: getWithIndexNoLock(), getByIndexNoLock(), hasNoLock().
    sei-db/state_db/sc/memiavl/proof.go
  • GetMembershipProof() / GetNonMembershipProof() now take the write lock and delegate to getMembershipProofNoLock() / getNonMembershipProofNoLock().
  • createExistenceProof() documented as "caller must hold the write lock".
    The read-only fast paths (Get, GetWithIndex, GetByIndex, Has, Iterator)
    are unchanged and still use RLock. Locking is per-tree (per-store).

Lock-ordering safety

  • Commit path acquires db.mtx then per-tree t.mtx; the query path acquires
    only t.mtx. No inversion, no new deadlock.
  • No in-package caller invokes the now-locking methods while already holding
    t.mtx (verified); internal callers use the *NoLock helpers.

Performance impact

  • Validators (no queries): the write lock is uncontended, so Lock() costs
    the same as the previous RLock() (a single atomic op). RootHash runs
    ~once/twice per store per block → single-digit microseconds per block. No
    measurable impact on block/consensus/tx-execution time.
  • Tx execution: unaffected — reads go through cache stores (RLock,
    unchanged); GetProof is not on the execution path.
  • RPC nodes under heavy prove=true: a commit's RootHash and a proof build
    now serialize per store. This is bounded (proof build is O(tree height))
    and per-store (committing store A never blocks proofs on store B). It is the
    necessary cost of correctness — the previous "cheaper" path was the bug.

Testing

Added sei-db/state_db/sc/memiavl/tree_race_test.go:

  • TestTreeProofRaceWithCommit — commit (Set + RootHash) concurrent with
    membership/non-membership GetProof and RootHash.
  • TestTreeConcurrentRootHash — concurrent RootHash over a tree with unfilled
    hash caches; asserts all callers agree.
    Verification:
  • The tests were confirmed to reproduce the bug: reverting the two locks back
    to RLock makes go test -race report DATA RACE on MemNode.hash.
  • With the fix, go test -race ./sei-db/state_db/sc/memiavl/... passes clean
    (0 data races), as do the commitment and rootmulti suites.

Risk & rollout

  • Read-path behavior and hash outputs are unchanged; this is a locking-only fix.
  • Safe to roll out without coordination (not AppHash-format changing). It
    prevents divergence rather than changing committed state.

yzang2019 added 2 commits July 1, 2026 07:58
* main:
  Add chain_id label to OTel metrics (#3692)
  Close temporary rootmulti store in channel types setup (#3690)
  Update changelog in prep to cut 6.6 RC1 (#3685)
  Pin priority-fee assertions to stable heads (#3680)
  Retry npm install for disable wasm integration test (#3674)
  State Store: Compact pruned key range after each prune (#3675)
@cursor

cursor Bot commented Jul 2, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Touches consensus-critical AppHash computation and live proof RPC paths; incorrect locking could affect correctness or query latency under load, though the change is locking-only with unchanged hash semantics.

Overview
Fixes a concurrency bug in memIAVL where RootHash, GetProof, and membership/non-membership proof builders could run under a read lock while MemNode.Hash() writes lazy hash caches in place. That race (Immunefi 83246 / STO-601) could corrupt internal node hashes and diverge AppHash when commits overlap with prove=true queries.

Locking change: RootHash, GetProof, GetMembershipProof, and GetNonMembershipProof now take t.mtx write locks for the full operation. Pure reads (Get, Has, iterators) stay on RLock. Internal *NoLock helpers (rootHashNoLock, getProofNoLock, getWithIndexNoLock, etc.) avoid recursive lock acquisition when proof paths already hold the write lock.

Docs: Comments on mtx, proof builders, and test-only Verify* helpers clarify write-lock requirements and that verify helpers are not atomic across commits.

Tests: New tree_race_test.go stress-tests concurrent commit + GetProof/RootHash and concurrent RootHash on unfilled hash caches, intended to pass under go test -race.

Reviewed by Cursor Bugbot for commit a04e4d4. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJul 2, 2026, 5:13 AM

@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.62%. Comparing base (9f53204) to head (a04e4d4).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3693      +/-   ##
==========================================
- Coverage   59.30%   58.62%   -0.68%     
==========================================
  Files        2273     2204      -69     
  Lines      188305   181378    -6927     
==========================================
- Hits       111675   106335    -5340     
+ Misses      66579    65566    -1013     
+ Partials    10051     9477     -574     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?
sei-db-state-db-pr 75.36% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/state_db/sc/memiavl/proof.go 90.90% <100.00%> (+0.64%) ⬆️
sei-db/state_db/sc/memiavl/tree.go 87.98% <100.00%> (+0.80%) ⬆️

... and 101 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-scoped, correct concurrency fix: it promotes the hash-mutating memIAVL operations (RootHash, GetProof, and the proof builders) from RLock to the write lock, since Node.Hash() lazily populates MemNode.hash in place, and adds *NoLock helpers to avoid recursive locking. Locking/ordering is verified safe and a race-reproducing test is included; only minor non-blocking notes remain.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • VerifyMembership/VerifyNonMembership (proof.go:52-56, 109-111) still acquire the lock twice — Get() under RLock then RootHash() under Lock — so the returned value and root are not read atomically and a concurrent commit between the two can produce an inconsistent (val, root) pair. This is pre-existing (not introduced here) and these helpers appear to be test/verification-only, so it's not blocking, but worth a follow-up or a note.
  • Codex second-opinion pass reported "No material findings" (and noted it could not run the race test because the sandbox only had Go 1.24.13, not the required 1.25.6). The Cursor second-opinion file was empty — no output was produced by that pass.
  • Optional: consider a brief comment noting that this write-lock serialization protects same-tree concurrency; trees produced via Tree.Copy() get their own mutex, so cross-copy hash consistency relies on hashes already being populated (via SaveVersion(true)) and on cowVersion cloning before structural mutation. Not a defect given current usage, but documenting the assumption would help future callers.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A correct, tightly-scoped locking fix that serializes the in-place MemNode.hash mutations in RootHash/GetProof/GetMembershipProof/GetNonMembershipProof under the tree's write lock, closing the AppHash-divergence data race (Immunefi 83246 / STO-601). No blocking issues found; only minor notes.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • The cursor-review.md second-opinion pass produced no output (empty file), so no Cursor findings were available to merge.
  • The codex-review.md pass reported no material findings but explicitly could not run the targeted -race tests (Go 1.25.6 toolchain download blocked by the sandbox network policy), so its clean result is based on static reading only, not test execution.
  • The fix only serializes hash mutations on a single *Tree instance (per-tree mtx). The author correctly documents in the mtx comment that cross-Copy() consistency relies on the pre-existing invariants (shared nodes fully hashed at commit before a copy reads them, and cowVersion cloning before structural mutation). This reasoning holds for the current code but is an unenforced invariant — worth a follow-up assertion/test if copy-based latest-height proof serving is ever added.
  • Performance tradeoff is real but acknowledged: on RPC nodes under heavy prove=true load, a commit's RootHash now serializes per-store against concurrent proof builds. Bounded (O(tree height)) and per-store, so acceptable as the cost of correctness.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

// TestTreeConcurrentRootHash covers concurrent RootHash() calls over a tree
// whose freshly-inserted nodes still have an empty hash cache, so every caller
// races to populate MemNode.hash. The digest is deterministic, so all callers
// must agree and (under -race) must not race.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] nit: TestTreeConcurrentRootHash derives its value almost entirely from go test -race detecting the concurrent lazy-fill of MemNode.hash. The require.Equal assertion on the resulting hashes would pass even with the buggy RLock version most of the time (the race is on the cache write, not necessarily the final digest). Consider a comment noting the test is only meaningful under -race, to prevent it being mistaken for a functional-correctness assertion.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A correct, well-documented locking-only fix that closes a real MemNode.hash data race (Immunefi 83246 / STO-601) by moving RootHash/GetProof/proof-builders to the write lock with non-reentrant *NoLock helpers; no reentrancy deadlocks, and tests reproduce the race and pass under -race. The only open items are a documented residual cross-Copy limitation (raised by Codex) and a bounded per-store serialization cost for proof queries.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Codex flagged that the per-tree write lock does not synchronize a tree against its Copy() (each gets a fresh mutex over a shared MemNode graph). This is a valid observation but appears non-reachable in practice: DB.Copy() and DB.Commit() serialize on db.mtx, Commit ends with SaveVersion(true) (full hashing), and COW (cowVersion) makes the live tree clone-before-mutate and only hash newly-created private nodes — so a Copy always observes fully-hashed shared subtrees and Hash() returns early. The author documented this invariant in the mtx comment. Not blocking, but worth a follow-up assertion/test to guard the 'fully hashed before Copy' invariant.
  • Performance tradeoff (acknowledged in the PR description): GetProof now holds the write lock for the whole proof build, so concurrent prove=true proof queries on the same store fully serialize. Acceptable for correctness; a future optimization could prehash so proofs can run under RLock again.
  • Second-opinion pass: the Cursor review file (cursor-review.md) was empty — no findings from that tool. REVIEW_GUIDELINES.md was also empty, so no repo-specific standards were applied.

Comment on lines +33 to +44
// mtx guards concurrent access to this tree's mutable state (root, version,
// cowVersion, snapshot) AND the lazily-populated MemNode.hash caches reachable
// from root. Operations that fill those caches in place — RootHash and the
// proof builders (GetProof/GetMembership/GetNonMembership) — take the write
// lock; pure reads (Get/Has/Iterator) take the read lock. This serialization
// only protects a single tree instance: a tree produced by Copy() gets its
// own mtx and shares the underlying nodes copy-on-write. Cross-copy hash
// consistency therefore relies on (a) the shared nodes already being fully
// hashed before the copy is used for hashing/proofs (Copy is taken between
// commits, and the commit path hashes via SaveVersion(true)/RootHash), and
// (b) cowVersion cloning any shared MemNode before it is structurally mutated,
// so a live-tree write never mutates a node another copy is still reading.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [nit] The new mtx docstring (tree.go:33-44) claims mtx guards root, version, cowVersion, snapshot in addition to the MemNode.hash caches, but the surrounding code doesn't consistently enforce that on the peripheral fields — most notably Copy() (line 119) writes t.cowVersion = t.version while holding only RLock, which is exactly the write-under-RLock anti-pattern the rest of this PR is closing for MemNode.hash. Version(), SaveVersion(), SetInitialVersion(), and SetZeroCopy() also read/write the enumerated fields with no lock. These are all pre-existing and not worth expanding the PR to fix, but since the new docstring will be treated as authoritative by future maintainers, worth tightening the wording to just what the write-lock ops actually cover (root + MemNode.hash caches) and calling out the known gaps on version/cowVersion.

Extended reasoning...

What the docstring claims vs what the code does

The PR replaces the old one-line comment on mtx with a much more detailed contract (tree.go:33-44) that reads, in part:

mtx guards concurrent access to this tree's mutable state (root, version, cowVersion, snapshot) AND the lazily-populated MemNode.hash caches reachable from root.

The load-bearing half of that claim — that hash-mutating operations (RootHash, GetProof, GetMembership/NonMembership) now take the write lock, and pure reads keep the read lock — is exactly what the PR delivers and is accurate. This comment is only about the peripheral field enumeration.

The Copy() write-under-RLock

Copy() (tree.go:114-127) takes RLock() and at line 119 writes t.cowVersion = t.version. Under Go's memory model this is a data race even though both concurrent callers would write the same value — the write is unsynchronized between the RLock holders, and -race would flag it. This is the same class of anti-pattern the PR is closing for MemNode.hash, applied to a field the new docstring newly asserts mtx guards.

Step-by-step: goroutine A calls Copy(), takes RLock, is scheduled out just before line 119. Goroutine B calls Copy(), takes RLock (RLocks compose), executes t.cowVersion = t.version at line 119. Goroutine A resumes and also writes t.cowVersion = t.version. Two writers, no synchronizing acquire/release between them → data race. In practice the effect is benign (idempotent value), but the docstring newly labels this exact pattern as forbidden.

Other unlocked accesses to enumerated fields

  • Version() (line 200-202) reads t.version with no lock.
  • SaveVersion() (line 185-197) reads t.version at 186 and writes it at 195; the intervening RootHash() acquires+releases the write lock, but the surrounding reads/writes are unlocked.
  • SetInitialVersion() (line 107) writes t.initialVersion with no lock (not listed in the docstring, so weaker).
  • SetZeroCopy() (line 96) writes t.zeroCopy with no lock (also not listed).

Addressing the refutation

The refuter is right that these are pre-existing, that in practice SaveVersion is single-writer on the consensus path, and that the Copy() cowVersion write happens to be idempotent. This is why the severity here is nit, not normal — I'm not asking the author to scope-creep tighter locking onto version/cowVersion for a race-fix hotfix. But the refuter also says the enumeration is "describing the intended contract in a comment ABOVE the mutex declaration; it is not a precondition/postcondition on any function" — that's precisely why it's worth flagging. A future maintainer reading "mtx guards … cowVersion" will reasonably assume the existing code enforces that, and won't notice they need to take the write lock when adding a new mutator, or that Copy's RLock is already violating it.

Suggested fix

Tightening the docstring is a one-line change and matches what the PR actually establishes. Something like:

mtx guards concurrent access to this tree's root pointer and the lazily-populated MemNode.hash caches reachable from root. (Note: version/cowVersion/snapshot are also mutated on this tree but are not currently protected by mtx — Copy() writes cowVersion under RLock, and Version()/SaveVersion() read/write version unlocked. These are pre-existing gaps out of scope for this fix.)

Alternatively, promote Copy()'s RLock to Lock() as a small in-scope tweak, since that specific write-under-RLock is the same anti-pattern the PR is closing elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant