Skip to content

fix(prover-node): rebuild pruned checkpoint provers and recover the epoch#24436

Open
PhilWindle wants to merge 1 commit into
merge-train/spartan-v5from
phil/a-1290-prune-induced-checkpoint-prover-failure
Open

fix(prover-node): rebuild pruned checkpoint provers and recover the epoch#24436
PhilWindle wants to merge 1 commit into
merge-train/spartan-v5from
phil/a-1290-prune-induced-checkpoint-prover-failure

Conversation

@PhilWindle

@PhilWindle PhilWindle commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Resolves A-1290.

Problem

A CheckpointProver's sub-tree work forks world-state per block. When an L1 reorg prunes a base block, those fork reads fault and the prover permanently rejects its one-shot blockProofs promise.

The store keeps the prover alive across a prune (markPruned) so a re-add of identical content can reuse its in-flight work. But a prune that removes the checkpoint also breaks the fork reads, so that reuse only ever hands back a poisoned prover: a re-add or a full EpochSession recreate re-references the same rejected blockProofs and fails immediately. The affected node then silently abandons the epoch until a process restart (a healthy prover elsewhere still proves it, so it's not a network stall).

Fix

Drop the reuse machinery — a pruned prover cannot survive, so rebuild instead of reuse.

  • CheckpointStore.cancelAndRemoveAboveBlock cancels and removes orphaned provers (replacing markPrunedAboveBlock); a re-add builds a fresh prover with a fresh blockProofs.
  • Removed the pruned flag / markPruned / markCanonical from CheckpointProver, and the SlotWatcher that only reaped lingering pruned provers.
  • listCanonical / addOrUpdate simplified — every prover in the store is now canonical.

Recover the epoch by rebuilding on re-add, rather than trying to prevent the terminal state.

A prune unwinds world-state as soon as it's detected — before the prover-node cancels the affected session — so a checkpoint prover mid-fork faults while still live and its session goes terminal failed. That race is inherent (the world-state rejection and the prover being cancelled are unordered), so instead of trying to reclassify it as a cancellation, we make the terminal state harmless:

  • SessionManager.openFullSessionIfReady no longer treats a terminal session as "already open" — it drops it and rebuilds. So re-adding an epoch's checkpoints reopens a fresh live session over fresh provers, via the checkpoint/prune triggers (ungated by the tick high-water mark). This reopen depends on the store no longer holding the poisoned prover: with the reuse machinery gone the rebuilt session gets a fresh prover with a healthy blockProofs, so it can actually prove rather than re-inheriting the rejected one.
  • runSession skips the failure-upload when the session's checkpoints no longer match canonical content — a prune-invalidated failure isn't a genuine proving failure, so it no longer emits a spurious post-mortem upload / alert.

The only non-recovered case is "content pruned and never re-added" (chain stops publishing that epoch's checkpoints) — there's nothing to prove on this node then, so abandoning the epoch is correct.

Cancel the pruned prover's in-flight work promptly. CheckpointProver now threads its abort signal into PublicProcessor.process, so a prune-driven cancel stops the current block's public execution immediately instead of running it to completion before the next signal.aborted check. This is independent of the recovery logic above — it just avoids wasting CPU re-executing a block whose checkpoint is being discarded. The in-flight block has not yet enqueued a broker proof, so aborting it loses no reusable work.

Note: broker-level proof reuse is unaffected

Removing the CheckpointProver-level reuse does not waste proving work. The expensive part — the SNARK proofs — is cached in the proving broker, content-addressed by job id, independent of world-state. cancelJobsOnStop defaults to false, so cancelling a prover does not abort its broker jobs; on an identical re-add the fresh prover regenerates identical inputs, and the broker dedups against the cached jobs/results (bounded by the broker's per-epoch cleanup). The reuse we deleted was witness-generation reuse, whose world-state forks cannot survive a prune anyway.

Testing

  • session-manager.test.ts: pins the retry/recovery invariant — the periodic tick does not re-attempt a failed epoch (gated by lastTickEpoch), but a checkpoint re-add recovers it (ungated); a failed session's epoch is reopened once its checkpoints are re-added; a genuine proving failure still uploads a post-mortem; a failure coinciding with a canonical content change (prune) skips the upload. (Red/green verified for both the gate and the upload suppression.)
  • checkpoint-store.test.ts / prover-node.test.ts: pruned provers are cancelled and removed (not flagged); a re-add builds a fresh prover.
  • checkpoint-prover.test.ts: cancelling a prover mid-block aborts the signal its public execution is running under (red/green verified).
  • Full prover-node suite passes (166 tests); package typechecks clean; lint OK.
  • Updated optimistic.parallel.test.ts (the e2e that spawned this issue) to the cancelled+removed semantics.

Notes

  • No operator/contract-dev-facing config or API changed, so no changelog entry.
  • The terminal failed state is not prevented, only recovered from: the world-state unwind and the prover cancellation are unordered, so the failure can't be reliably reclassified from the session side. Recovering on re-add is deterministic and needs no such race to be won.

@PhilWindle PhilWindle added the ci-draft Run CI on draft PRs. label Jul 1, 2026
@PhilWindle PhilWindle force-pushed the phil/a-1290-prune-induced-checkpoint-prover-failure branch 2 times, most recently from 9455223 to c98a7bc Compare July 2, 2026 14:29
…poch

A CheckpointProver's sub-tree work forks world-state per block. When an L1 reorg
prunes a base block, those fork reads fault and the prover permanently rejects
its one-shot blockProofs promise. The store previously kept the prover alive
(markPruned) so a re-add of identical content could reuse its in-flight work,
but a prune that removes the checkpoint also breaks the fork reads, so the reuse
only ever handed back a poisoned prover: a re-add or a full EpochSession recreate
re-referenced the same rejected blockProofs and failed immediately, and the node
silently abandoned the epoch until a process restart.

Drop the reuse machinery — a pruned prover cannot survive:
- CheckpointStore.cancelAndRemoveAboveBlock cancels and removes orphaned provers
  (replacing markPrunedAboveBlock); a re-add builds a fresh prover.
- Remove the pruned flag / markPruned / markCanonical from CheckpointProver and
  the SlotWatcher that only reaped lingering pruned provers.

Recover the epoch by rebuilding on re-add rather than racing the failure. A
prune unwinds world-state before the prover-node cancels the session, so a
prover mid-fork faults while live and the session goes terminal `failed`; that
race cannot be won from the session side, so make the terminal state harmless:
- SessionManager.openFullSessionIfReady drops a terminal session and rebuilds,
  so re-adding an epoch's checkpoints reopens a fresh live session over fresh
  provers (via the checkpoint/prune triggers, ungated by the tick high-water
  mark).
- runSession skips the failure-upload when the session's checkpoints no longer
  match canonical content, so a prune-invalidated failure emits no spurious
  post-mortem.

Cancel the pruned prover's in-flight work promptly: thread the abort signal into
PublicProcessor.process so a cancel stops the current block's public execution
immediately instead of running it to completion.

The expensive proving is unaffected: broker proofs are content-addressed and
survive the prune (cancelJobsOnStop defaults to false), so an identical re-add
reuses them; only witness generation, which cannot survive a prune, is redone.
@PhilWindle PhilWindle force-pushed the phil/a-1290-prune-induced-checkpoint-prover-failure branch from c98a7bc to dea205a Compare July 2, 2026 14:38
@PhilWindle PhilWindle marked this pull request as ready for review July 2, 2026 14:51
@PhilWindle PhilWindle changed the title fix(prover-node): cancel and rebuild pruned checkpoint provers fix(prover-node): rebuild pruned checkpoint provers and recover the epoch Jul 2, 2026
@AztecBot

AztecBot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/d83d8e5401e1437d�d83d8e5401e1437d8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/single-node/proving/proof_fails.parallel.test.ts "does not allow submitting proof after epoch end" (119s) (code: 0) group:e2e-p2p-epoch-flakes

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few comments, plus an issue codex flagged. Also, should we add a specific e2e to try and reproduce this issue? Not sure how to trigger it reliably instead of as a flake though. And also, the prover-node README is stale given these changes.

Comment on lines +107 to +108
void prover.whenDone();
this.provers.delete(id);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that by deleting it from this.provers we won't await for their teardown in stop via the await Promise.allSettled(provers.map(p => p.whenDone()));, though I guess that's ok since we no longer care about them.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes I think we should track these separately. I did query Claude about this, more from the angle of not leaving dangling resources. But I think it makes sense to continue to track and deal with these, even if they are rare.


/** Canonical (non-pruned) provers in the store, sorted by checkpoint number. */
/** Canonical provers in the store, sorted by checkpoint number. */
public listCanonical(): CheckpointProver[] {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the "canonical" terminology in this PR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, becuase they are all now canonical? Yes, good point.

// block a rebuild. Drop it so a fresh session is opened over the current canonical content. This
// is what recovers a prune-failed epoch when its checkpoints are re-added — the checkpoint/prune
// triggers reach here ungated by lastTickEpoch, so the epoch does not wait on the periodic tick.
this.fullSessions.delete(epoch);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the recreateInvalidSessions loop that runs before openFullSessionIfReady deal with deleting these?

this.fullSessions.delete(epoch);
}
if (this.atMaxSessionLimit()) {
this.log.debug(`Skipping full-session open for epoch ${epoch}: max pending jobs reached`);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex note: I think this still races the actual prune. This check only looks at CheckpointStore, but world-state and the prover-node consume separate block streams, so a fork can fail after world-state unwinds and before the prover-node has run cancelAndRemoveAboveBlock. In that window canonical content can still match here, so onSessionFailed still uploads the spurious post-mortem. Consider checking the source chain’s current checkpoint content, or another prune-aware signal, before uploading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants