fix(prover-node): rebuild pruned checkpoint provers and recover the epoch#24436
fix(prover-node): rebuild pruned checkpoint provers and recover the epoch#24436PhilWindle wants to merge 1 commit into
Conversation
9455223 to
c98a7bc
Compare
…poch A CheckpointProver's sub-tree work forks world-state per block. When an L1 reorg prunes a base block, those fork reads fault and the prover permanently rejects its one-shot blockProofs promise. The store previously kept the prover alive (markPruned) so a re-add of identical content could reuse its in-flight work, but a prune that removes the checkpoint also breaks the fork reads, so the reuse only ever handed back a poisoned prover: a re-add or a full EpochSession recreate re-referenced the same rejected blockProofs and failed immediately, and the node silently abandoned the epoch until a process restart. Drop the reuse machinery — a pruned prover cannot survive: - CheckpointStore.cancelAndRemoveAboveBlock cancels and removes orphaned provers (replacing markPrunedAboveBlock); a re-add builds a fresh prover. - Remove the pruned flag / markPruned / markCanonical from CheckpointProver and the SlotWatcher that only reaped lingering pruned provers. Recover the epoch by rebuilding on re-add rather than racing the failure. A prune unwinds world-state before the prover-node cancels the session, so a prover mid-fork faults while live and the session goes terminal `failed`; that race cannot be won from the session side, so make the terminal state harmless: - SessionManager.openFullSessionIfReady drops a terminal session and rebuilds, so re-adding an epoch's checkpoints reopens a fresh live session over fresh provers (via the checkpoint/prune triggers, ungated by the tick high-water mark). - runSession skips the failure-upload when the session's checkpoints no longer match canonical content, so a prune-invalidated failure emits no spurious post-mortem. Cancel the pruned prover's in-flight work promptly: thread the abort signal into PublicProcessor.process so a cancel stops the current block's public execution immediately instead of running it to completion. The expensive proving is unaffected: broker proofs are content-addressed and survive the prune (cancelJobsOnStop defaults to false), so an identical re-add reuses them; only witness generation, which cannot survive a prune, is redone.
c98a7bc to
dea205a
Compare
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
spalladino
left a comment
There was a problem hiding this comment.
Looks good! Just a few comments, plus an issue codex flagged. Also, should we add a specific e2e to try and reproduce this issue? Not sure how to trigger it reliably instead of as a flake though. And also, the prover-node README is stale given these changes.
| void prover.whenDone(); | ||
| this.provers.delete(id); |
There was a problem hiding this comment.
Heads up that by deleting it from this.provers we won't await for their teardown in stop via the await Promise.allSettled(provers.map(p => p.whenDone()));, though I guess that's ok since we no longer care about them.
There was a problem hiding this comment.
Hmm, yes I think we should track these separately. I did query Claude about this, more from the angle of not leaving dangling resources. But I think it makes sense to continue to track and deal with these, even if they are rare.
|
|
||
| /** Canonical (non-pruned) provers in the store, sorted by checkpoint number. */ | ||
| /** Canonical provers in the store, sorted by checkpoint number. */ | ||
| public listCanonical(): CheckpointProver[] { |
There was a problem hiding this comment.
Should we remove the "canonical" terminology in this PR?
There was a problem hiding this comment.
Oh, becuase they are all now canonical? Yes, good point.
| // block a rebuild. Drop it so a fresh session is opened over the current canonical content. This | ||
| // is what recovers a prune-failed epoch when its checkpoints are re-added — the checkpoint/prune | ||
| // triggers reach here ungated by lastTickEpoch, so the epoch does not wait on the periodic tick. | ||
| this.fullSessions.delete(epoch); |
There was a problem hiding this comment.
Doesn't the recreateInvalidSessions loop that runs before openFullSessionIfReady deal with deleting these?
| this.fullSessions.delete(epoch); | ||
| } | ||
| if (this.atMaxSessionLimit()) { | ||
| this.log.debug(`Skipping full-session open for epoch ${epoch}: max pending jobs reached`); |
There was a problem hiding this comment.
Codex note: I think this still races the actual prune. This check only looks at CheckpointStore, but world-state and the prover-node consume separate block streams, so a fork can fail after world-state unwinds and before the prover-node has run cancelAndRemoveAboveBlock. In that window canonical content can still match here, so onSessionFailed still uploads the spurious post-mortem. Consider checking the source chain’s current checkpoint content, or another prune-aware signal, before uploading.
Resolves A-1290.
Problem
A
CheckpointProver's sub-tree work forks world-state per block. When an L1 reorg prunes a base block, those fork reads fault and the prover permanently rejects its one-shotblockProofspromise.The store keeps the prover alive across a prune (
markPruned) so a re-add of identical content can reuse its in-flight work. But a prune that removes the checkpoint also breaks the fork reads, so that reuse only ever hands back a poisoned prover: a re-add or a fullEpochSessionrecreate re-references the same rejectedblockProofsand fails immediately. The affected node then silently abandons the epoch until a process restart (a healthy prover elsewhere still proves it, so it's not a network stall).Fix
Drop the reuse machinery — a pruned prover cannot survive, so rebuild instead of reuse.
CheckpointStore.cancelAndRemoveAboveBlockcancels and removes orphaned provers (replacingmarkPrunedAboveBlock); a re-add builds a fresh prover with a freshblockProofs.prunedflag /markPruned/markCanonicalfromCheckpointProver, and theSlotWatcherthat only reaped lingering pruned provers.listCanonical/addOrUpdatesimplified — every prover in the store is now canonical.Recover the epoch by rebuilding on re-add, rather than trying to prevent the terminal state.
A prune unwinds world-state as soon as it's detected — before the prover-node cancels the affected session — so a checkpoint prover mid-fork faults while still live and its session goes terminal
failed. That race is inherent (the world-state rejection and the prover being cancelled are unordered), so instead of trying to reclassify it as a cancellation, we make the terminal state harmless:SessionManager.openFullSessionIfReadyno longer treats a terminal session as "already open" — it drops it and rebuilds. So re-adding an epoch's checkpoints reopens a fresh live session over fresh provers, via the checkpoint/prune triggers (ungated by the tick high-water mark). This reopen depends on the store no longer holding the poisoned prover: with the reuse machinery gone the rebuilt session gets a fresh prover with a healthyblockProofs, so it can actually prove rather than re-inheriting the rejected one.runSessionskips the failure-upload when the session's checkpoints no longer match canonical content — a prune-invalidated failure isn't a genuine proving failure, so it no longer emits a spurious post-mortem upload / alert.The only non-recovered case is "content pruned and never re-added" (chain stops publishing that epoch's checkpoints) — there's nothing to prove on this node then, so abandoning the epoch is correct.
Cancel the pruned prover's in-flight work promptly.
CheckpointProvernow threads its abort signal intoPublicProcessor.process, so a prune-driven cancel stops the current block's public execution immediately instead of running it to completion before the nextsignal.abortedcheck. This is independent of the recovery logic above — it just avoids wasting CPU re-executing a block whose checkpoint is being discarded. The in-flight block has not yet enqueued a broker proof, so aborting it loses no reusable work.Note: broker-level proof reuse is unaffected
Removing the
CheckpointProver-level reuse does not waste proving work. The expensive part — the SNARK proofs — is cached in the proving broker, content-addressed by job id, independent of world-state.cancelJobsOnStopdefaults tofalse, so cancelling a prover does not abort its broker jobs; on an identical re-add the fresh prover regenerates identical inputs, and the broker dedups against the cached jobs/results (bounded by the broker's per-epoch cleanup). The reuse we deleted was witness-generation reuse, whose world-state forks cannot survive a prune anyway.Testing
session-manager.test.ts: pins the retry/recovery invariant — the periodic tick does not re-attempt a failed epoch (gated bylastTickEpoch), but a checkpoint re-add recovers it (ungated); afailedsession's epoch is reopened once its checkpoints are re-added; a genuine proving failure still uploads a post-mortem; a failure coinciding with a canonical content change (prune) skips the upload. (Red/green verified for both the gate and the upload suppression.)checkpoint-store.test.ts/prover-node.test.ts: pruned provers are cancelled and removed (not flagged); a re-add builds a fresh prover.checkpoint-prover.test.ts: cancelling a prover mid-block aborts the signal its public execution is running under (red/green verified).optimistic.parallel.test.ts(the e2e that spawned this issue) to the cancelled+removed semantics.Notes
failedstate is not prevented, only recovered from: the world-state unwind and the prover cancellation are unordered, so the failure can't be reliably reclassified from the session side. Recovering on re-add is deterministic and needs no such race to be won.