fix(sequencer): make lifecycle idempotent and restartable, with graceful stop#24475
fix(sequencer): make lifecycle idempotent and restartable, with graceful stop#24475spalladino wants to merge 6 commits into
Conversation
…ful stop
Sequencer.start() and PublisherManager.start() were not idempotent: a second
start() orphaned the previous RunningPromise poll loop, and a stopped-then-
restarted sequencer never cleared the publishers' interrupted flag, so it
would build blocks but silently never publish to L1.
- Make start()/stop() idempotent across SequencerClient, Sequencer, and
PublisherManager; refuse start() while STOPPING to avoid orphaning a fresh
loop mid-stop.
- Add a working restart path: PublisherManager.start() clears the interrupted
flag via the previously-unused restart() methods, and SequencerClient.start()
starts the publisher manager before the sequencer loop so publishers are
un-interrupted before the first publish. The funding loop is created once in
the constructor and restarted across stop/start cycles.
- Add stop({graceful: true}): halt the poll loop first so the in-flight work()
iteration and its pending L1 submission complete untouched, instead of being
interrupted mid-build (which emits a spurious checkpoint-error and drops the
enqueued checkpoint). Default stop() keeps the fast interrupting shutdown.
- Track the two fire-and-forget fallback sends (vote-and-prune, escape-hatch)
and interrupt+await them once the poll loop has stopped, so no sleeping
sender survives a stop and publishes a stale-slot tx after a restart.
Lifecycle calls are assumed to be serialized by the caller (all current
callers are).
Add warpWithSequencersStopped(nodes, warpFn, opts) to SingleNodeTestContext (inherited by MultiNodeTestContext): gracefully stop every sequencer (the in-flight iteration and its pending L1 submission complete untouched, and any pending fallback vote is drained), run warpFn with nobody building, then restart them (default; restart: false leaves them stopped). Archivers, provers, and the chain monitor keep running, so clock-driven effects such as an orphan-block prune still fire. Apply it to pipeline_prune to collapse the ~2-minute dead gap where the chain just waits for the L1 clock to roll past the orphan slot's prune deadline: once the orphan blocks exist and publishing is disabled, warp two L1 slots into the slot after the orphaned one, and restart the sequencers only after the prune is confirmed. TX_COUNT is unchanged, so assertMultipleBlocksPerSlot and assertProposerPipelining still hold. The speedup can only be validated in CI; the multi-node suite cannot run locally.
…sends via RequestsTracker
Addresses review feedback on the lifecycle PR:
- start() now throws instead of warning when called while STOPPING, so a caller cannot
believe the sequencer started when the in-flight stop will drive it to STOPPED. A start
while already running stays an idempotent no-op.
- Extract RequestsTracker, a self-removing bag of {promise, interrupt?}, and use it for both
the fire-and-forget fallback sends in the sequencer and the L1 submission in the checkpoint
proposal job, replacing the bespoke pendingFallbackSends map and the pendingL1Submission field.
Replace the warpFn callback in warpWithSequencersStopped with a cheat-codes instance plus an absolute target timestamp, and perform the warp internally only after the graceful stop has drained. A graceful stop can span several slots, so a target computed before the stop may already be in the past by the time the sequencers are down; skip the warp when the L1 clock has already reached or passed the target so evm_setNextBlockTimestamp cannot reject it.
…top/start ValidatorHASigner.stop() closed the slashing-protection LMDB/Postgres store, but start() never reopened it (the AztecLMDBStoreV2 instance is not reopenable). A restarted sequencer therefore threw 'Store is closed' from SlashingProtectionService.checkAndRecord on every block-proposal signing attempt, stalling recovery until the executeTimeout fired. Make stop() a pause that leaves the store open (matching the restartable lifecycle model of the sequencer and publisher manager); the store is released when the process exits (the node calls process.exit after stopping). Adds a red/green LMDB-backed restart test.
|
CI root cause + fix ( The
Fixed in 9de9057: |
The previous commit made ValidatorHASigner.stop() a pause that leaves the slashing-protection store open so the sequencer can be restarted, which meant nothing closed the store on a real shutdown. Add an explicit teardown path that closes it: ValidatorHASigner.close(), ValidatorClient.close() (which closes the store only when this client created it — not when a shared database was injected, as in HA test setups), SequencerClient.close(), and a call to it from AztecNodeService.stop(). stop() remains a restartable pause; close() is the closed-for-good teardown that releases the database connection / file lock. Adds a test asserting close() releases the store while stop() keeps it usable.
Supersedes #24449 and #24450, reimplementing both on a simpler design. Single PR since the lifecycle work is only exercised by the e2e primitive.
Context
Several multi-node e2e tests spend minutes of wall-clock waiting for the L1 clock to roll while live sequencers sit idle. Warping the shared clock under a running sequencer interrupts whatever iteration is mid-build, producing the reorg / "Sequencer was interrupted" failures behind earlier revert attempts. Pausing sequencers around the warp requires a lifecycle that can actually stop and restart, which it couldn't: a second
start()orphaned the previous poll loop (two loops racing), and a stopped-then-restarted sequencer never cleared the publishers'interruptedflag, so it would build and propose blocks but silently never publish to L1.Approach
Idempotent, restartable lifecycle.
start()/stop()are idempotent acrossSequencerClient,Sequencer, andPublisherManager, andstart()whileSTOPPINGis refused so a mid-stop start cannot orphan a fresh loop.PublisherManager.start()clears the interrupted flag via the previously-unusedrestart()methods, andSequencerClient.start()starts the publisher manager before the sequencer loop, so publishers are un-interrupted before the first post-restart publish. The funding loop is created once in the constructor and restarted across stop/start cycles, and a start that failed to load publisher state can be retried.Graceful stop.
stop({ graceful: true })halts the poll loop first, so the in-flightwork()iteration and its pending L1 submission complete untouched instead of being interrupted mid-build (which emits a spuriouscheckpoint-errorand drops the enqueued checkpoint). The drain runs before enteringSTOPPING, since in that state the iteration's ownsetStatecalls throwSequencerInterruptedError— which would fail the very iteration being drained. Defaultstop()keeps the fast interrupting shutdown for production teardown. This replaces the wait-for-IDLE heuristic from #24450, which had a race: sequencers reach IDLE at different times, and any of them could re-enterwork()before the stop landed, flaking any test that asserts no sequencer failure events.No stale sends across restart. The two fire-and-forget fallback sends (vote-and-prune when we cannot build, escape-hatch votes) are tracked, then interrupted and awaited once the poll loop has stopped. A sender sleeping until its target slot therefore cannot survive a stop and publish a stale-slot tx after a restart clears the pooled publishers' interrupted flag: interrupting the wrapper publisher is permanent, since wrappers are never restarted.
e2e primitive + pilot.
warpWithSequencersStopped(nodes, warpFn, opts)onSingleNodeTestContext(inherited byMultiNodeTestContext) gracefully stops every sequencer, runs the warp with nobody building, and restarts them (restart: falseleaves them stopped). Archivers, provers, and the chain monitor keep running, so clock-driven effects such as an orphan-block prune still fire. Applied topipeline_pruneto collapse its ~2-minute dead wait for the orphan slot's prune deadline: warp two L1 slots into the slot after the orphaned one, and restart the sequencers only once the prune is confirmed.TX_COUNTis unchanged, soassertMultipleBlocksPerSlotandassertProposerPipeliningstill hold.Lifecycle calls are assumed to be serialized by the caller (all current callers are); this does not add a mutex for truly concurrent start/stop.
The
pipeline_prunespeedup is validated by this PR's CI run — the multi-node suite can't be run locally.