Skip to content

fix(sequencer): make lifecycle idempotent and restartable, with graceful stop#24475

Open
spalladino wants to merge 6 commits into
merge-train/spartan-v5from
spl/sequencer-graceful-stop-restart
Open

fix(sequencer): make lifecycle idempotent and restartable, with graceful stop#24475
spalladino wants to merge 6 commits into
merge-train/spartan-v5from
spl/sequencer-graceful-stop-restart

Conversation

@spalladino

Copy link
Copy Markdown
Contributor

Supersedes #24449 and #24450, reimplementing both on a simpler design. Single PR since the lifecycle work is only exercised by the e2e primitive.

Context

Several multi-node e2e tests spend minutes of wall-clock waiting for the L1 clock to roll while live sequencers sit idle. Warping the shared clock under a running sequencer interrupts whatever iteration is mid-build, producing the reorg / "Sequencer was interrupted" failures behind earlier revert attempts. Pausing sequencers around the warp requires a lifecycle that can actually stop and restart, which it couldn't: a second start() orphaned the previous poll loop (two loops racing), and a stopped-then-restarted sequencer never cleared the publishers' interrupted flag, so it would build and propose blocks but silently never publish to L1.

Approach

Idempotent, restartable lifecycle. start()/stop() are idempotent across SequencerClient, Sequencer, and PublisherManager, and start() while STOPPING is refused so a mid-stop start cannot orphan a fresh loop. PublisherManager.start() clears the interrupted flag via the previously-unused restart() methods, and SequencerClient.start() starts the publisher manager before the sequencer loop, so publishers are un-interrupted before the first post-restart publish. The funding loop is created once in the constructor and restarted across stop/start cycles, and a start that failed to load publisher state can be retried.

Graceful stop. stop({ graceful: true }) halts the poll loop first, so the in-flight work() iteration and its pending L1 submission complete untouched instead of being interrupted mid-build (which emits a spurious checkpoint-error and drops the enqueued checkpoint). The drain runs before entering STOPPING, since in that state the iteration's own setState calls throw SequencerInterruptedError — which would fail the very iteration being drained. Default stop() keeps the fast interrupting shutdown for production teardown. This replaces the wait-for-IDLE heuristic from #24450, which had a race: sequencers reach IDLE at different times, and any of them could re-enter work() before the stop landed, flaking any test that asserts no sequencer failure events.

No stale sends across restart. The two fire-and-forget fallback sends (vote-and-prune when we cannot build, escape-hatch votes) are tracked, then interrupted and awaited once the poll loop has stopped. A sender sleeping until its target slot therefore cannot survive a stop and publish a stale-slot tx after a restart clears the pooled publishers' interrupted flag: interrupting the wrapper publisher is permanent, since wrappers are never restarted.

e2e primitive + pilot. warpWithSequencersStopped(nodes, warpFn, opts) on SingleNodeTestContext (inherited by MultiNodeTestContext) gracefully stops every sequencer, runs the warp with nobody building, and restarts them (restart: false leaves them stopped). Archivers, provers, and the chain monitor keep running, so clock-driven effects such as an orphan-block prune still fire. Applied to pipeline_prune to collapse its ~2-minute dead wait for the orphan slot's prune deadline: warp two L1 slots into the slot after the orphaned one, and restart the sequencers only once the prune is confirmed. TX_COUNT is unchanged, so assertMultipleBlocksPerSlot and assertProposerPipelining still hold.

Lifecycle calls are assumed to be serialized by the caller (all current callers are); this does not add a mutex for truly concurrent start/stop.

The pipeline_prune speedup is validated by this PR's CI run — the multi-node suite can't be run locally.

…ful stop

Sequencer.start() and PublisherManager.start() were not idempotent: a second
start() orphaned the previous RunningPromise poll loop, and a stopped-then-
restarted sequencer never cleared the publishers' interrupted flag, so it
would build blocks but silently never publish to L1.

- Make start()/stop() idempotent across SequencerClient, Sequencer, and
  PublisherManager; refuse start() while STOPPING to avoid orphaning a fresh
  loop mid-stop.
- Add a working restart path: PublisherManager.start() clears the interrupted
  flag via the previously-unused restart() methods, and SequencerClient.start()
  starts the publisher manager before the sequencer loop so publishers are
  un-interrupted before the first publish. The funding loop is created once in
  the constructor and restarted across stop/start cycles.
- Add stop({graceful: true}): halt the poll loop first so the in-flight work()
  iteration and its pending L1 submission complete untouched, instead of being
  interrupted mid-build (which emits a spurious checkpoint-error and drops the
  enqueued checkpoint). Default stop() keeps the fast interrupting shutdown.
- Track the two fire-and-forget fallback sends (vote-and-prune, escape-hatch)
  and interrupt+await them once the poll loop has stopped, so no sleeping
  sender survives a stop and publishes a stale-slot tx after a restart.

Lifecycle calls are assumed to be serialized by the caller (all current
callers are).
Add warpWithSequencersStopped(nodes, warpFn, opts) to SingleNodeTestContext
(inherited by MultiNodeTestContext): gracefully stop every sequencer (the
in-flight iteration and its pending L1 submission complete untouched, and any
pending fallback vote is drained), run warpFn with nobody building, then
restart them (default; restart: false leaves them stopped). Archivers,
provers, and the chain monitor keep running, so clock-driven effects such as
an orphan-block prune still fire.

Apply it to pipeline_prune to collapse the ~2-minute dead gap where the chain
just waits for the L1 clock to roll past the orphan slot's prune deadline:
once the orphan blocks exist and publishing is disabled, warp two L1 slots
into the slot after the orphaned one, and restart the sequencers only after
the prune is confirmed. TX_COUNT is unchanged, so assertMultipleBlocksPerSlot
and assertProposerPipelining still hold.

The speedup can only be validated in CI; the multi-node suite cannot run
locally.
Comment thread yarn-project/sequencer-client/src/sequencer/sequencer.ts
Comment thread yarn-project/end-to-end/src/single-node/single_node_test_context.ts Outdated
Comment thread yarn-project/sequencer-client/src/sequencer/sequencer.ts Outdated
…sends via RequestsTracker

Addresses review feedback on the lifecycle PR:
- start() now throws instead of warning when called while STOPPING, so a caller cannot
  believe the sequencer started when the in-flight stop will drive it to STOPPED. A start
  while already running stays an idempotent no-op.
- Extract RequestsTracker, a self-removing bag of {promise, interrupt?}, and use it for both
  the fire-and-forget fallback sends in the sequencer and the L1 submission in the checkpoint
  proposal job, replacing the bespoke pendingFallbackSends map and the pendingL1Submission field.
Replace the warpFn callback in warpWithSequencersStopped with a cheat-codes
instance plus an absolute target timestamp, and perform the warp internally
only after the graceful stop has drained. A graceful stop can span several
slots, so a target computed before the stop may already be in the past by the
time the sequencers are down; skip the warp when the L1 clock has already
reached or passed the target so evm_setNextBlockTimestamp cannot reject it.
…top/start

ValidatorHASigner.stop() closed the slashing-protection LMDB/Postgres store, but start()
never reopened it (the AztecLMDBStoreV2 instance is not reopenable). A restarted sequencer
therefore threw 'Store is closed' from SlashingProtectionService.checkAndRecord on every
block-proposal signing attempt, stalling recovery until the executeTimeout fired.

Make stop() a pause that leaves the store open (matching the restartable lifecycle model of
the sequencer and publisher manager); the store is released when the process exits (the node
calls process.exit after stopping). Adds a red/green LMDB-backed restart test.
@spalladino

Copy link
Copy Markdown
Contributor Author

CI root cause + fix (pipeline_prune)

The pipeline_prune recovery test was red, and the cause was not the warp-in-the-past hypothesis (the warp succeeded and the prune fired, block 5→14). The real failure was a 6-minute executeTimeout: every restarted sequencer threw Error: Store is closed on each block-proposal signing attempt:

AztecLMDBStoreV2.transactionAsync
  -> LmdbSlashingProtectionDatabase.tryInsertOrGetExisting
  -> SlashingProtectionService.checkAndRecord

ValidatorHASigner.stop() closed the slashing-protection store, but start() never reopened it (the AztecLMDBStoreV2 instance isn't reopenable), so recovery never made progress after the restart.

Fixed in 9de9057: stop() is now a pause that leaves the store open (matching the restartable-lifecycle model this PR already applies to the sequencer and publisher manager); the store is released on process exit (the node calls process.exit after stopping). Added a red/green LMDB-backed restart test in validator_ha_signer.test.ts that reproduces the exact Store is closed failure and passes with the fix.

The previous commit made ValidatorHASigner.stop() a pause that leaves the slashing-protection
store open so the sequencer can be restarted, which meant nothing closed the store on a real
shutdown. Add an explicit teardown path that closes it: ValidatorHASigner.close(),
ValidatorClient.close() (which closes the store only when this client created it — not when a
shared database was injected, as in HA test setups), SequencerClient.close(), and a call to it
from AztecNodeService.stop(). stop() remains a restartable pause; close() is the closed-for-good
teardown that releases the database connection / file lock.

Adds a test asserting close() releases the store while stop() keeps it usable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant