Skip to content

fix(sequencer): make lifecycle idempotent and restartable#24449

Open
spalladino wants to merge 3 commits into
merge-train/spartan-v5from
spl/sequencer-restart-idempotency
Open

fix(sequencer): make lifecycle idempotent and restartable#24449
spalladino wants to merge 3 commits into
merge-train/spartan-v5from
spl/sequencer-restart-idempotency

Conversation

@spalladino

Copy link
Copy Markdown
Contributor

Context

Sequencer.start() and PublisherManager.start() were not idempotent: a second start() orphaned the previous RunningPromise poll loop (leaving two loops racing), and a stopped-then-restarted sequencer never cleared the publishers' interrupted flag — so it would build and propose blocks but silently never publish to L1. This blocked merging sequential scenarios in tests (invalidate_block.parallel needed a forced stop between scenarios) and is a prerequisite for stop→drain→warp→restart test tooling.

Approach

  • Make start()/stop() idempotent across SequencerClient, Sequencer, and PublisherManager: a redundant start() no-ops instead of orphaning a loop, and start() while STOPPING is refused to close the start-races-stop race.
  • Add a working restart path: PublisherManager.start() now clears the interrupted flag (via the previously-unused restart() methods on the publishers and funder), and SequencerClient.start() starts the publisher manager before the sequencer loop so publishers are un-interrupted before the first publish.
  • Prevent a stale-slot tx across restart: the two fire-and-forget fallback sends (vote-and-prune / escape-hatch) are tracked, and stop() interrupts each wrapper's independent sleep and awaits it after the poll loop stops, so no dangling sleeper survives to be re-armed by a later restart.
  • stop() remains a complete terminal drain for production teardown; it is merely also re-startable once fully resolved.

Lifecycle calls are assumed to be serialized by the caller (all current callers are); this does not add a mutex for truly concurrent start/stop.

The sequencer's start/stop chain was not idempotent and had no working
restart: a second start() news up a fresh RunningPromise and orphans the
prior poll loop, and PublisherManager.start() after a stop() never cleared
the publishers' interrupted flag, so a restarted sequencer would build and
propose but every L1 publish would throw InterruptError.

Changes:
- Sequencer.start() is now idempotent (no-op if the poll loop is running),
  so a double start never orphans a second loop.
- Sequencer.stop() is idempotent and now drains the fire-and-forget fallback
  sends (tryVoteAndPruneWhenCannotBuild / tryVoteWhenEscapeHatchOpen): each
  send's wrapper publisher is tracked, interrupted on stop (waking its own
  interruptible sleep, independent of the pooled L1TxUtils), and awaited.
  This removes the dangling-sleeper that a later restart would otherwise
  re-arm into a stale-slot publish once interrupted is cleared.
- PublisherManager.start() calls restart() on every publisher and the funder
  to clear the interrupted flag (re-enabling L1 publishing after a stop), and
  only starts the funding loop when it is not already running. stop() is
  idempotent and drops the funding-loop reference.
- SequencerClient.start() starts the publisher manager before the sequencer's
  poll loop so publishers are un-interrupted before the first publish; stop()
  remains a complete terminal teardown.

stop() is preserved as a full drain and terminal shutdown; the added restart
path is layered on top without weakening it.

Tests: idempotent double-start/stop, restart clears the interrupted flag and
re-enables publishing, full terminal stop tears everything down, and no stale
fallback tx survives a stop across PublisherManager/Sequencer/RunningPromise.
Follow-up hardening from review of the lifecycle idempotency change:

- Sequencer.start() now refuses while state is STOPPING. Without this, a
  start() landing after stop() set STOPPING but before it finished could
  allocate a fresh poll loop that the in-flight stop() would then mark
  STOPPED, orphaning a running loop under a stopped state.
- Move the fallback-wrapper interrupt out of the pre-stopAll loop and into
  drainPendingFallbackSends(), which runs after runningPromise.stop(). The
  last in-flight work() iteration registers its fire-and-forget fallback send
  synchronously before returning, so only after the poll loop has stopped is
  pendingFallbackSends final; interrupting there catches that last send too.
  Interrupting before runningPromise.stop() would miss it, leaving a sleeper
  that a later restart could re-arm into a stale-slot publish.

Adds tests for both: start-while-stopping is refused, and a fallback send
registered after stop() began is still interrupted and drained.
…started

A bare double start() re-ran loadStateAndResumeMonitoring(), which spawns a
background monitor per pending nonce — and those monitors can send speed-up
and cancellation txs, so a duplicate monitor for the same nonce could race
the original. Guard start() with a `started` flag so a redundant start is a
no-op; stop() clears the flag so a genuine restart-after-stop still reloads
state and resumes monitoring of in-flight txs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-do-not-merge Status: Do not merge this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant