fix(prover-node): rebuild pruned checkpoint provers and recover the epoch by PhilWindle · Pull Request #24436 · AztecProtocol/aztec-packages

PhilWindle · 2026-07-01T16:13:26Z

Resolves A-1290.

Problem

A CheckpointProver's sub-tree work forks world-state per block. When an L1 reorg prunes a base block, those fork reads fault and the prover permanently rejects its one-shot blockProofs promise.

The store keeps the prover alive across a prune (markPruned) so a re-add of identical content can reuse its in-flight work. But a prune that removes the checkpoint also breaks the fork reads, so that reuse only ever hands back a poisoned prover: a re-add or a full EpochSession recreate re-references the same rejected blockProofs and fails immediately. The affected node then silently abandons the epoch until a process restart (a healthy prover elsewhere still proves it, so it's not a network stall).

Fix

Drop the reuse machinery — a pruned prover cannot survive, so rebuild instead of reuse.

CheckpointStore.cancelAndRemoveAboveBlock cancels and removes orphaned provers (replacing markPrunedAboveBlock); a re-add builds a fresh prover with a fresh blockProofs.
Removed the pruned flag / markPruned / markCanonical from CheckpointProver, and the SlotWatcher that only reaped lingering pruned provers.
listCanonical / addOrUpdate simplified — every prover in the store is now canonical.

Recover the epoch by rebuilding on re-add, rather than trying to prevent the terminal state.

A prune unwinds world-state as soon as it's detected — before the prover-node cancels the affected session — so a checkpoint prover mid-fork faults while still live and its session goes terminal failed. That race is inherent (the world-state rejection and the prover being cancelled are unordered), so instead of trying to reclassify it as a cancellation, we make the terminal state harmless:

SessionManager.openFullSessionIfReady no longer treats a terminal session as "already open" — it drops it and rebuilds. So re-adding an epoch's checkpoints reopens a fresh live session over fresh provers, via the checkpoint/prune triggers (ungated by the tick high-water mark). This reopen depends on the store no longer holding the poisoned prover: with the reuse machinery gone the rebuilt session gets a fresh prover with a healthy blockProofs, so it can actually prove rather than re-inheriting the rejected one.
runSession skips the failure-upload when the session's checkpoints no longer match canonical content — a prune-invalidated failure isn't a genuine proving failure, so it no longer emits a spurious post-mortem upload / alert.

The only non-recovered case is "content pruned and never re-added" (chain stops publishing that epoch's checkpoints) — there's nothing to prove on this node then, so abandoning the epoch is correct.

Cancel the pruned prover's in-flight work promptly. CheckpointProver now threads its abort signal into PublicProcessor.process, so a prune-driven cancel stops the current block's public execution immediately instead of running it to completion before the next signal.aborted check. This is independent of the recovery logic above — it just avoids wasting CPU re-executing a block whose checkpoint is being discarded. The in-flight block has not yet enqueued a broker proof, so aborting it loses no reusable work.

Note: broker-level proof reuse is unaffected

Removing the CheckpointProver-level reuse does not waste proving work. The expensive part — the SNARK proofs — is cached in the proving broker, content-addressed by job id, independent of world-state. cancelJobsOnStop defaults to false, so cancelling a prover does not abort its broker jobs; on an identical re-add the fresh prover regenerates identical inputs, and the broker dedups against the cached jobs/results (bounded by the broker's per-epoch cleanup). The reuse we deleted was witness-generation reuse, whose world-state forks cannot survive a prune anyway.

Testing

session-manager.test.ts: pins the retry/recovery invariant — the periodic tick does not re-attempt a failed epoch (gated by lastTickEpoch), but a checkpoint re-add recovers it (ungated); a failed session's epoch is reopened once its checkpoints are re-added; a genuine proving failure still uploads a post-mortem; a failure coinciding with a canonical content change (prune) skips the upload. (Red/green verified for both the gate and the upload suppression.)
checkpoint-store.test.ts / prover-node.test.ts: pruned provers are cancelled and removed (not flagged); a re-add builds a fresh prover.
checkpoint-prover.test.ts: cancelling a prover mid-block aborts the signal its public execution is running under (red/green verified).
Full prover-node suite passes (166 tests); package typechecks clean; lint OK.
Updated optimistic.parallel.test.ts (the e2e that spawned this issue) to the cancelled+removed semantics.

Notes

No operator/contract-dev-facing config or API changed, so no changelog entry.
The terminal failed state is not prevented, only recovered from: the world-state unwind and the prover cancellation are unordered, so the failure can't be reliably reclassified from the session side. Recovering on re-add is deterministic and needs no such race to be won.

…poch A CheckpointProver's sub-tree work forks world-state per block. When an L1 reorg prunes a base block, those fork reads fault and the prover permanently rejects its one-shot blockProofs promise. The store previously kept the prover alive (markPruned) so a re-add of identical content could reuse its in-flight work, but a prune that removes the checkpoint also breaks the fork reads, so the reuse only ever handed back a poisoned prover: a re-add or a full EpochSession recreate re-referenced the same rejected blockProofs and failed immediately, and the node silently abandoned the epoch until a process restart. Drop the reuse machinery — a pruned prover cannot survive: - CheckpointStore.cancelAndRemoveAboveBlock cancels and removes orphaned provers (replacing markPrunedAboveBlock); a re-add builds a fresh prover. - Remove the pruned flag / markPruned / markCanonical from CheckpointProver and the SlotWatcher that only reaped lingering pruned provers. Recover the epoch by rebuilding on re-add rather than racing the failure. A prune unwinds world-state before the prover-node cancels the session, so a prover mid-fork faults while live and the session goes terminal `failed`; that race cannot be won from the session side, so make the terminal state harmless: - SessionManager.openFullSessionIfReady drops a terminal session and rebuilds, so re-adding an epoch's checkpoints reopens a fresh live session over fresh provers (via the checkpoint/prune triggers, ungated by the tick high-water mark). - runSession skips the failure-upload when the session's checkpoints no longer match canonical content, so a prune-invalidated failure emits no spurious post-mortem. Cancel the pruned prover's in-flight work promptly: thread the abort signal into PublicProcessor.process so a cancel stops the current block's public execution immediately instead of running it to completion. The expensive proving is unaffected: broker proofs are content-addressed and survive the prune (cancelJobsOnStop defaults to false), so an identical re-add reuses them; only witness generation, which cannot survive a prune, is redone.

AztecBot · 2026-07-02T15:13:56Z

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/d83d8e5401e1437d�d83d8e5401e1437d8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/single-node/proving/proof_fails.parallel.test.ts "does not allow submitting proof after epoch end" (119s) (code: 0) group:e2e-p2p-epoch-flakes

spalladino

Looks good! Just a few comments, plus an issue codex flagged. Also, should we add a specific e2e to try and reproduce this issue? Not sure how to trigger it reliably instead of as a flake though. And also, the prover-node README is stale given these changes.

spalladino · 2026-07-02T17:44:15Z

+        void prover.whenDone();
+        this.provers.delete(id);


Heads up that by deleting it from this.provers we won't await for their teardown in stop via the await Promise.allSettled(provers.map(p => p.whenDone()));, though I guess that's ok since we no longer care about them.

Hmm, yes I think we should track these separately. I did query Claude about this, more from the angle of not leaving dangling resources. But I think it makes sense to continue to track and deal with these, even if they are rare.

spalladino · 2026-07-02T17:44:34Z


-  /** Canonical (non-pruned) provers in the store, sorted by checkpoint number. */
+  /** Canonical provers in the store, sorted by checkpoint number. */
  public listCanonical(): CheckpointProver[] {


Should we remove the "canonical" terminology in this PR?

Oh, becuase they are all now canonical? Yes, good point.

spalladino · 2026-07-02T17:47:05Z

+      // block a rebuild. Drop it so a fresh session is opened over the current canonical content. This
+      // is what recovers a prune-failed epoch when its checkpoints are re-added — the checkpoint/prune
+      // triggers reach here ungated by lastTickEpoch, so the epoch does not wait on the periodic tick.
+      this.fullSessions.delete(epoch);


Doesn't the recreateInvalidSessions loop that runs before openFullSessionIfReady deal with deleting these?

spalladino · 2026-07-02T17:51:12Z

+      this.fullSessions.delete(epoch);
    }
    if (this.atMaxSessionLimit()) {
      this.log.debug(`Skipping full-session open for epoch ${epoch}: max pending jobs reached`);


Codex note: I think this still races the actual prune. This check only looks at CheckpointStore, but world-state and the prover-node consume separate block streams, so a fork can fail after world-state unwinds and before the prover-node has run cancelAndRemoveAboveBlock. In that window canonical content can still match here, so onSessionFailed still uploads the spurious post-mortem. Consider checking the source chain’s current checkpoint content, or another prune-aware signal, before uploading.

PhilWindle added the ci-draft Run CI on draft PRs. label Jul 1, 2026

PhilWindle force-pushed the phil/a-1290-prune-induced-checkpoint-prover-failure branch 2 times, most recently from 9455223 to c98a7bc Compare July 2, 2026 14:29

PhilWindle force-pushed the phil/a-1290-prune-induced-checkpoint-prover-failure branch from c98a7bc to dea205a Compare July 2, 2026 14:38

PhilWindle marked this pull request as ready for review July 2, 2026 14:51

PhilWindle changed the title ~~fix(prover-node): cancel and rebuild pruned checkpoint provers~~ fix(prover-node): rebuild pruned checkpoint provers and recover the epoch Jul 2, 2026

spalladino approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(prover-node): rebuild pruned checkpoint provers and recover the epoch#24436

fix(prover-node): rebuild pruned checkpoint provers and recover the epoch#24436
PhilWindle wants to merge 1 commit into
merge-train/spartan-v5from
phil/a-1290-prune-induced-checkpoint-prover-failure

PhilWindle commented Jul 1, 2026 •

edited

Loading

Uh oh!

AztecBot commented Jul 2, 2026

Uh oh!

spalladino left a comment

Uh oh!

spalladino Jul 2, 2026

Uh oh!

PhilWindle Jul 2, 2026

Uh oh!

spalladino Jul 2, 2026

Uh oh!

PhilWindle Jul 2, 2026

Uh oh!

spalladino Jul 2, 2026

Uh oh!

spalladino Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

PhilWindle commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Note: broker-level proof reuse is unaffected

Testing

Notes

Uh oh!

AztecBot commented Jul 2, 2026

Flakey Tests

Uh oh!

spalladino left a comment

Choose a reason for hiding this comment

Uh oh!

spalladino Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

PhilWindle Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

spalladino Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

PhilWindle Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

spalladino Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

spalladino Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PhilWindle commented Jul 1, 2026 •

edited

Loading