Server lifecycle: event transition table + CAS single-writer#129
Open
ymichael wants to merge 12 commits into
Open
Server lifecycle: event transition table + CAS single-writer#129ymichael wants to merge 12 commits into
ymichael wants to merge 12 commits into
Conversation
Step 1 of plans/server-lifecycle-transition-core.md: inventory all thread status transition call sites, derive an event vocabulary, and encode the observed (from, event, to) triples plus per-event supersession predicates as pure data in @bb/domain. Behavior-neutral: cells that look wrong are kept with `// observed:` comments; tightening is a separate follow-up. No caller changes yet. Event vocabulary (14, no payloads — the threads row stores no turn id and nothing in table/predicates/writer consumes event data): turn.started, turn.completed, turn.failed, turn.interrupted, runtime.exited, turn.dispatched, reprovision.started, start.succeeded, command.failed, provision.failed, workspace.lost, stop.completed, session.lost, runtime.observed-active. The full per-site inventory (19 transition call sites across 12 files, turn/completed splitting into 3 events) and the 58-line status-guard sweep classification (15 become predicates/table cells, 42 non-lifecycle, 1 suspicious + 3 suspicious call-site observations) live as the header comment of packages/domain/test/thread-lifecycle.test.ts for review. Notable findings: - internal/events.ts:313 skips turn/started activation when stopRequestedAt is set, contradicting plan decision point 3's parity assumption; turn.started therefore carries notStopRequested. - thread-schedule-sweep dispatch+activation has no stopRequestedAt guard. - queued-messages dispatch never re-checks deletedAt/stopRequestedAt in its transaction after claiming the message. - thread-send applies error→active optimistically at dispatch. - environment-level provision failure errors every live thread bound to the environment regardless of its own status (created/idle/active). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Step 2 of plans/server-lifecycle-transition-core.md. Adds
applyThreadLifecycleEvent (own transaction, notifies status-changed only
when applied) and applyThreadLifecycleEventInTransaction (caller owns the
transaction and notification) to packages/db. In one transaction: load
row, run the pure domain evaluator (THREAD_LIFECYCLE + supersession
predicates), then UPDATE … WHERE id AND status = <loaded status>
RETURNING as a compare-and-set. latestAttentionAt logic is shared with
the old writer via statusTransitionNeedsAttention, byte-for-byte parity
covered by tests.
Returns a typed outcome instead of throwing:
{applied: true, thread} | {applied: false, reason:
"illegal-transition" | "superseded" | "not-found" | "cas-conflict",
detail}. requireThreadLifecycleEventApplied(outcome) throws a typed
ThreadLifecycleEventNotAppliedError for boundary callers where a no-op
is a real 4xx. Outcome logging stays at the server layer (db has no
logger).
The old transitionThreadStatus / transitionThreadStatusInTransaction API
is untouched but marked @deprecated; callers migrate in step 3.
Tests (in-memory SQLite + real migrate): applied paths (standalone and
in-transaction), illegal-transition / superseded (deletedAt,
stopRequestedAt) / not-found no-ops leaving rows untouched and
unnotified, sequential-stale second event, latestAttentionAt parity with
the old writer across four representative transitions, and a genuine
cas-conflict via a pass-through proxy that issues a real concurrent
UPDATE between the writer's load and its CAS update (better-sqlite3's
synchronous transactions make that interleave unreachable through the
public API alone).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…vents
Step 3a of plans/server-lifecycle-transition-core.md: migrate the hottest
thread transition sites — daemon turn events, runtime exit, reprovision
dispatch, and stop-completion finalize — from target-status writes
(tryTransition / transitionThreadStatusInTransaction) to
applyThreadLifecycleEvent with the inventoried events: turn.started,
turn.completed, turn.failed, turn.interrupted, runtime.exited,
reprovision.started, stop.completed, session.lost.
Adds services/threads/lifecycle-outcome.ts, the one server-side wrapper
pair that logs every non-applied outcome ({threadId, event, reason,
detail}) so stale events become observable instead of silently swallowed.
Guards deleted because they are now the event's predicates or table cells:
- internal/events.ts turn/started: stopRequestedAt null guard
(turn.started carries notStopRequested) and the pre-start/idle/error
from-status guard (table cells).
- internal/events.ts provider_process_exited: stopRequestedAt null guard
(runtime.exited carries notStopRequested).
- internal/turn-completed-events.ts: the stopRequestedAt guard for failed
turns (turn.failed predicate) and the from-status guard for completed
turns (table cells).
Stronger caller-side guards kept (per the inventory's weakest-common-set
judgment):
- hasThreadStopBeforeTurnStarted / event-log staleness checks stay at the
callers (rows cannot express them).
- thread-turn-dispatch keeps the full idle-or-recoverable-errored guard:
the error arm requires no provider thread id, which is stronger than the
error->provisioning reprovision.started cell, and the composite cannot
be split without changing dispatch routing.
Behavior notes (race-window only): applyTurnCompletedEvent now gates its
pruning side effects and returned nextStatus on outcome.applied, so a
turn.interrupted arriving for an already-idle thread no longer re-runs
idle pruning that the old code ran even when the transition was swallowed;
pruning re-runs on the next applied idle transition.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ycle events Step 3b of plans/server-lifecycle-transition-core.md: migrate the thread-side provisioning transitions to the event API. - failThreadProvisioning (thread-provisioning-environment.ts) sends provision.failed instead of tryTransition(error). - recordEnvironmentProvisioningFailureInTransaction (environment-provisioning-internal.ts) sends provision.failed per live thread; the status-changed notify is now gated on outcome.applied, matching the old tryTransitionInTransaction shim's notify-on-success. - markLiveThreadsErroredAfterDestroySuccess (environment-cleanup-internal.ts) sends workspace.lost, same gating. Environment.status writes in these files are untouched (step 5). Stronger caller-side guards kept (weakest-common-set): the provision-context staleness checks (status === "provisioning" plus notDeleted/notArchived/notStopRequested) at failThreadProvisioning's call sites stay — provision.failed's table cells are wider (any live status, for environment-level failure) and only carry notDeleted, and those guards also gate context cleanup and the error-event append. Likewise listLiveEnvironmentThreads / listLiveThreadsInEnvironment filtering and shouldPreserveThreadProvisionCancellationOutcome (provisioningId-scoped, event-log) stay at the callers because they gate the event appends too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… transition API Step 3c of plans/server-lifecycle-transition-core.md: migrate the remaining transition sites and remove the target-status API entirely. - thread-schedule-sweep: due-schedule start dispatch sends turn.dispatched (the path's missing stopRequestedAt guard is a documented latent bug and is preserved — turn.dispatched declares no supersession predicates). - thread-send / parent-system-messages: dispatch activation sends turn.dispatched through requireThreadLifecycleEventApplied, preserving the old throwing-writer semantics at these boundaries (a stale dispatch still aborts the transaction). The error→active pre-ack optimistic activation in thread-send is preserved as the observed error cell. - queued-messages: auto-send claim sends turn.dispatched; the claim-without-re-check latent bug is preserved (no new in-tx re-check). - thread-lifecycle: command.failed, start.succeeded, session.lost / stop.completed (daemon-restart interruption batches), and runtime.observed-active (reconnect revival). The canActivateThreadAfterSuccessfulStart guard shrinks to isThreadStartActivationStale — its row-flag and from-status checks are now start.succeeded's predicates and table cells; only the event-log staleness checks remain. nextStatusForInterruptedThread is replaced by the reason→event mapping. - Deleted: services/threads/thread-transitions.ts (tryTransition shims) and its test (outcome semantics are covered by the db writer tests); transitionThreadStatus / transitionThreadStatusInTransaction / transitionThreadStatusRecord, InvalidThreadStatusTransitionError, TransitionThreadStatusInTransactionArgs, and ALLOWED_TRANSITIONS (now dead — THREAD_LIFECYCLE in @bb/domain is the single source of truth) from packages/db, plus their exports. - db tests: the transitionThreadStatus suite's read-state/attention tests are rewritten against applyThreadLifecycleEvent; legality and no-op cases are already covered by the domain table tests and the writer tests, and the writer's latestAttentionAt test now asserts expected values directly instead of comparing against the deleted old writer. Stronger caller-side guards kept (weakest-common-set): - thread-send keeps "dispatchKind === turn.submit || status === error": the error arm is dispatch routing (errored threads re-activate optimistically on thread.start dispatches, idle ones do not), stronger than the turn.dispatched cells. - queued-messages keeps the idle-only auto-send routing guard — it gates the entire claim-and-send flow, not just the transition. - finalizeStoppedThreadInTransaction keeps its active/pre-start routing: active picks the turn-interruption path; idle/error finalizes stay no-transition without emitting noise events. - Reconnect revival keeps the targeted SQL filters (status/deletedAt/ stopRequestedAt) and the event-log blocked-revival check. Behavior notes (race-window only): interruptActiveThreads no longer aborts the whole batch transaction when one thread's status changed between selection and apply — that thread's event no-ops and is logged, the rest proceed (previously an InvalidThreadStatusTransitionError threw away the entire batch including the appended interruption events). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Step 4 of plans/server-lifecycle-transition-core.md: the three
module-level Sets in thread-lifecycle.ts (activeThreadStartRpcThreadIds,
activeThreadStartGeneratedTitleSyncThreadIds,
activeThreadStopRpcThreadIds) become a single InFlightRpcGuard with
claim/release/isHeld keyed threadId × kind, defined in the lifecycle
owner module. Process-local by design: durable cross-restart intent
already lives on the thread row (stopRequestedAt/deletedAt); the guard
only prevents duplicate concurrent RPCs in one process — no status-ladder
growth.
The generated-title-sync entry is a flag riding the in-flight
thread.start RPC rather than a dedupe of its own, but its lifetime is
exactly claim-at-dispatch / read-at-settlement / release-with-the-RPC
keyed threadId × kind, so it is modeled as the third kind
("thread.start.title-sync") instead of keeping a parallel one-off Set.
Behavior identical: the stop path's check-then-add was a single
synchronous block, so it merges into one claim() call; the start path
keeps its pre-dispatch isHeld checks (claiming there would mark the RPC
in flight while the command is still being built, which observers like
requestThreadStopForCurrentState route on).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Behavior-neutral first pass derived from the write-site inventory in the test header: 9 events, byPath targets for settled-state restores, and row-level supersession predicates (managed, cleanup intent, workspace path, destroyAttemptId match). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
applyEnvironmentLifecycleEvent (+InTransaction, +requireApplied) loads the row, evaluates ENVIRONMENT_LIFECYCLE, and compare-and-sets the status in one immediate transaction; the destroy claim re-asserts the cross-table thread conditions inside the same UPDATE. Adds the pure recordProvisionedEnvironmentWorkspace metadata writer (the status half of applyProvisionedEnvironmentRecord becomes a provision.succeeded event) and listStaleDestroyingManagedEnvironments for the sweep. The Proxy-interleave CAS helper moves to a shared test helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All environment status writes now report what happened (provision.requested/succeeded/failed/cancelled, destroy.dispatched/ succeeded/failed/lost, cleanup.completed) through the logged CAS writer; the direct setters (setEnvironmentStatus, applyProvisionedEnvironmentRecord, setEnvironmentRecordDestroyed, claimEnvironmentDestroy, restoreEnvironmentAfterDestroyAttemptFailure, recoverStaleDestroyingEnvironmentCleanup) are deleted along with the dead claimManagedEnvironmentReprovisionRecord, the unreachable restoreEnvironmentAfterCleanupCancellation, and a provably no-op provision request on a freshly created environment. Surviving status guards are flow/routing concerns and carry comments saying so. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Server-service-level race tests for the lifecycle-event writers (step 6 of
plans/server-lifecycle-transition-core.md), each driven through the real
settlement/ingestion entry points and asserting outcome plus persisted rows:
- thread-live-start-handoff: a thread.start that succeeds while its stop RPC
is still in flight does not reactivate the thread (stop intent preserved).
- environment-provisioning: a provision success settling after the thread was
tombstoned by project deletion finalizes the thread instead of activating
it, and routes the orphaned workspace into cleanup.
- internal-events-tool-calls: a redelivered turn/completed for a settled idle
thread is an illegal-transition no-op; the row is byte-for-byte untouched.
- thread-stop-retry: re-finalizing an already-settled stop (daemon reconnect
reconciliation) changes nothing and appends no duplicate events.
- environment-cleanup: a destroy success settling after the environment was
revived by a reprovision request is an illegal-transition no-op; the row
and the revived thread are untouched.
CAS-conflict interleaving stays covered by the db writer tests
(packages/db/test/data/{thread,environment}-lifecycle.test.ts), which reach
the compare-and-set branch via the proxy interleave harness; it is
unreachable through the service API under synchronous transactions.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Classifies the status guards left in services/environments after the lifecycle-event migration. Seven are lifecycle-relevant survivors; four already carried comments from the migration, and this adds the missing three (preflight precondition, cleanup-advance routing, provision dispatch routing) plus brief "not lifecycle" notes on the boundary-validation and stop-routing reads whose role was not obvious from context. The remaining grep hits are read-path/display/event-data checks that need no comment. No guard was provably subsumed by the writers' predicates, so none were deleted (behavior preserved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The transition tables are pure data, so the state machines can be visualized without a state-machine dependency: renderLifecycleMermaid turns a table + per-event supersession predicates into a Mermaid stateDiagram-v2, and docs/lifecycle-diagrams.md commits the rendered thread and environment machines — GitHub renders Mermaid natively, so the diagrams are visible in the repo and change visibly in PR diffs whenever a transition changes. Path-dependent environment targets (settle to ready with a workspace on disk, error/destroyed without) render as two annotated edges. A file-snapshot test keeps the doc in sync with the tables; regenerate with vitest -u as documented in the file header. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Thread and environment status changes now flow through a single owner each: a pure
(status, event) → statustransition table in@bb/domainplus one CAS-guarded writer in@bb/db(UPDATE … SET status WHERE id = ? AND status = ? RETURNING). Callers stop choosing target statuses and instead report what happened (14 events:turn.started,provision.failed,stop.completed, …); each event declares its supersession predicates (notDeleted,notStopRequested, …) once in the table; illegal or superseded transitions return a typed no-op that gets logged — never thrown, never silent.Replaces: ~21 transition call sites picking statuses ad hoc, the throwing
transitionThreadStatusAPI +tryTransitioncatch-shims,ALLOWED_TRANSITIONS, 3 module-level in-flight Sets (now oneInFlightRpcGuard), and 5 direct.set({status})sites on environments — includingapplyProvisionedEnvironment, which smuggled status through a metadata update.Behavior-neutral by design: the tables encode observed behavior, including the questionable parts, marked
// observed:. Tightening is a separate follow-up with its own QA.Why
~30 race fix-commits in 2 months all had one shape: a late async result applied after the world moved on, each fixed with one more scattered guard (90 of them at peak). Status is now writable by exactly two functions; staleness is structurally impossible to forget and observable when it happens.
Review guide — where the judgment calls live
packages/domain/test/thread-lifecycle.test.tsandpackages/domain/test/environment-lifecycle.test.ts. Every transition site: from→to, assigned event, observed guards, and the classification of all 58+ status guards (became-predicate / non-lifecycle / suspicious).// observed:cells inpackages/domain/src/thread-lifecycle.tsandenvironment-lifecycle.ts— current behaviors preserved verbatim that look wrong, e.g.provisioning + turn.completed → idle(a late turn-completion flips a revived thread to idle), and "a late provision success revives an in-flight destroy".stopRequestedAtguard; queued-message dispatch never re-checksdeletedAt/stopRequestedAtafter claiming;thread-sendapplieserror→activebefore daemon ack.Behavior deltas (race-window only, both deliberate)
turn.interruptedfor an already-idle thread no longer re-runs best-effort idle pruning.interruptActiveThreadsno longer rolls back the whole batch when one thread's status moved between selection and apply — that thread no-ops with a log.Validation
@bb/domain17,@bb/db27,@bb/server77 test files green; integration suite 22/22 (--force).status:inside.set(appears only in the two writers + creation defaults; old API zero hits outside inventory comments; transition-protecting guard survivors: threads 4, environments 7, each commented.The 11 commits tell the story in order: inventory+table → writer → three thread migration chunks → in-flight guard → environment trio → race tests → guard audit.
🤖 Generated with Claude Code