fix(sequencer): make lifecycle idempotent and restartable, with graceful stop by spalladino · Pull Request #24475 · AztecProtocol/aztec-packages

spalladino · 2026-07-02T19:49:52Z

Supersedes #24449 and #24450, reimplementing both on a simpler design. Single PR since the lifecycle work is only exercised by the e2e primitive.

Context

Several multi-node e2e tests spend minutes of wall-clock waiting for the L1 clock to roll while live sequencers sit idle. Warping the shared clock under a running sequencer interrupts whatever iteration is mid-build, producing the reorg / "Sequencer was interrupted" failures behind earlier revert attempts. Pausing sequencers around the warp requires a lifecycle that can actually stop and restart, which it couldn't: a second start() orphaned the previous poll loop (two loops racing), and a stopped-then-restarted sequencer never cleared the publishers' interrupted flag, so it would build and propose blocks but silently never publish to L1.

Approach

Idempotent, restartable lifecycle. start()/stop() are idempotent across SequencerClient, Sequencer, and PublisherManager, and start() while STOPPING is refused so a mid-stop start cannot orphan a fresh loop. PublisherManager.start() clears the interrupted flag via the previously-unused restart() methods, and SequencerClient.start() starts the publisher manager before the sequencer loop, so publishers are un-interrupted before the first post-restart publish. The funding loop is created once in the constructor and restarted across stop/start cycles, and a start that failed to load publisher state can be retried.

Graceful stop. stop({ graceful: true }) halts the poll loop first, so the in-flight work() iteration and its pending L1 submission complete untouched instead of being interrupted mid-build (which emits a spurious checkpoint-error and drops the enqueued checkpoint). The drain runs before entering STOPPING, since in that state the iteration's own setState calls throw SequencerInterruptedError — which would fail the very iteration being drained. Default stop() keeps the fast interrupting shutdown for production teardown. This replaces the wait-for-IDLE heuristic from #24450, which had a race: sequencers reach IDLE at different times, and any of them could re-enter work() before the stop landed, flaking any test that asserts no sequencer failure events.

No stale sends across restart. The two fire-and-forget fallback sends (vote-and-prune when we cannot build, escape-hatch votes) are tracked, then interrupted and awaited once the poll loop has stopped. A sender sleeping until its target slot therefore cannot survive a stop and publish a stale-slot tx after a restart clears the pooled publishers' interrupted flag: interrupting the wrapper publisher is permanent, since wrappers are never restarted.

e2e primitive + pilot. warpWithSequencersStopped(nodes, warpFn, opts) on SingleNodeTestContext (inherited by MultiNodeTestContext) gracefully stops every sequencer, runs the warp with nobody building, and restarts them (restart: false leaves them stopped). Archivers, provers, and the chain monitor keep running, so clock-driven effects such as an orphan-block prune still fire. Applied to pipeline_prune to collapse its ~2-minute dead wait for the orphan slot's prune deadline: warp two L1 slots into the slot after the orphaned one, and restart the sequencers only once the prune is confirmed. TX_COUNT is unchanged, so assertMultipleBlocksPerSlot and assertProposerPipelining still hold.

Lifecycle calls are assumed to be serialized by the caller (all current callers are); this does not add a mutex for truly concurrent start/stop.

The pipeline_prune speedup is validated by this PR's CI run — the multi-node suite can't be run locally.

…ful stop Sequencer.start() and PublisherManager.start() were not idempotent: a second start() orphaned the previous RunningPromise poll loop, and a stopped-then- restarted sequencer never cleared the publishers' interrupted flag, so it would build blocks but silently never publish to L1. - Make start()/stop() idempotent across SequencerClient, Sequencer, and PublisherManager; refuse start() while STOPPING to avoid orphaning a fresh loop mid-stop. - Add a working restart path: PublisherManager.start() clears the interrupted flag via the previously-unused restart() methods, and SequencerClient.start() starts the publisher manager before the sequencer loop so publishers are un-interrupted before the first publish. The funding loop is created once in the constructor and restarted across stop/start cycles. - Add stop({graceful: true}): halt the poll loop first so the in-flight work() iteration and its pending L1 submission complete untouched, instead of being interrupted mid-build (which emits a spurious checkpoint-error and drops the enqueued checkpoint). Default stop() keeps the fast interrupting shutdown. - Track the two fire-and-forget fallback sends (vote-and-prune, escape-hatch) and interrupt+await them once the poll loop has stopped, so no sleeping sender survives a stop and publishes a stale-slot tx after a restart. Lifecycle calls are assumed to be serialized by the caller (all current callers are).

Add warpWithSequencersStopped(nodes, warpFn, opts) to SingleNodeTestContext (inherited by MultiNodeTestContext): gracefully stop every sequencer (the in-flight iteration and its pending L1 submission complete untouched, and any pending fallback vote is drained), run warpFn with nobody building, then restart them (default; restart: false leaves them stopped). Archivers, provers, and the chain monitor keep running, so clock-driven effects such as an orphan-block prune still fire. Apply it to pipeline_prune to collapse the ~2-minute dead gap where the chain just waits for the L1 clock to roll past the orphan slot's prune deadline: once the orphan blocks exist and publishing is disabled, warp two L1 slots into the slot after the orphaned one, and restart the sequencers only after the prune is confirmed. TX_COUNT is unchanged, so assertMultipleBlocksPerSlot and assertProposerPipelining still hold. The speedup can only be validated in CI; the multi-node suite cannot run locally.

…sends via RequestsTracker Addresses review feedback on the lifecycle PR: - start() now throws instead of warning when called while STOPPING, so a caller cannot believe the sequencer started when the in-flight stop will drive it to STOPPED. A start while already running stays an idempotent no-op. - Extract RequestsTracker, a self-removing bag of {promise, interrupt?}, and use it for both the fire-and-forget fallback sends in the sequencer and the L1 submission in the checkpoint proposal job, replacing the bespoke pendingFallbackSends map and the pendingL1Submission field.

Replace the warpFn callback in warpWithSequencersStopped with a cheat-codes instance plus an absolute target timestamp, and perform the warp internally only after the graceful stop has drained. A graceful stop can span several slots, so a target computed before the stop may already be in the past by the time the sequencers are down; skip the warp when the L1 clock has already reached or passed the target so evm_setNextBlockTimestamp cannot reject it.

…top/start ValidatorHASigner.stop() closed the slashing-protection LMDB/Postgres store, but start() never reopened it (the AztecLMDBStoreV2 instance is not reopenable). A restarted sequencer therefore threw 'Store is closed' from SlashingProtectionService.checkAndRecord on every block-proposal signing attempt, stalling recovery until the executeTimeout fired. Make stop() a pause that leaves the store open (matching the restartable lifecycle model of the sequencer and publisher manager); the store is released when the process exits (the node calls process.exit after stopping). Adds a red/green LMDB-backed restart test.

spalladino · 2026-07-02T21:21:08Z

CI root cause + fix (pipeline_prune)

The pipeline_prune recovery test was red, and the cause was not the warp-in-the-past hypothesis (the warp succeeded and the prune fired, block 5→14). The real failure was a 6-minute executeTimeout: every restarted sequencer threw Error: Store is closed on each block-proposal signing attempt:

AztecLMDBStoreV2.transactionAsync
  -> LmdbSlashingProtectionDatabase.tryInsertOrGetExisting
  -> SlashingProtectionService.checkAndRecord

ValidatorHASigner.stop() closed the slashing-protection store, but start() never reopened it (the AztecLMDBStoreV2 instance isn't reopenable), so recovery never made progress after the restart.

Fixed in 9de9057: stop() is now a pause that leaves the store open (matching the restartable-lifecycle model this PR already applies to the sequencer and publisher manager); the store is released on process exit (the node calls process.exit after stopping). Added a red/green LMDB-backed restart test in validator_ha_signer.test.ts that reproduces the exact Store is closed failure and passes with the fix.

The previous commit made ValidatorHASigner.stop() a pause that leaves the slashing-protection store open so the sequencer can be restarted, which meant nothing closed the store on a real shutdown. Add an explicit teardown path that closes it: ValidatorHASigner.close(), ValidatorClient.close() (which closes the store only when this client created it — not when a shared database was injected, as in HA test setups), SequencerClient.close(), and a call to it from AztecNodeService.stop(). stop() remains a restartable pause; close() is the closed-for-good teardown that releases the database connection / file lock. Adds a test asserting close() releases the store while stop() keeps it usable.

spalladino added 2 commits July 2, 2026 16:27

spalladino commented Jul 2, 2026

View reviewed changes

Comment thread yarn-project/sequencer-client/src/sequencer/sequencer.ts

Comment thread yarn-project/end-to-end/src/single-node/single_node_test_context.ts Outdated

Comment thread yarn-project/sequencer-client/src/sequencer/sequencer.ts Outdated

This was referenced Jul 2, 2026

fix(sequencer): make lifecycle idempotent and restartable #24449

Closed

perf(e2e): stopDrainWarpRestart primitive + pipeline_prune pilot #24450

Closed

spalladino added 3 commits July 2, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sequencer): make lifecycle idempotent and restartable, with graceful stop#24475

fix(sequencer): make lifecycle idempotent and restartable, with graceful stop#24475
spalladino wants to merge 6 commits into
merge-train/spartan-v5from
spl/sequencer-graceful-stop-restart

spalladino commented Jul 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spalladino commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

spalladino commented Jul 2, 2026

Context

Approach

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spalladino commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant