Server lifecycle: event transition table + CAS single-writer by ymichael · Pull Request #129 · ymichael/bb

ymichael · 2026-06-12T23:23:49Z

What

Thread and environment status changes now flow through a single owner each: a pure (status, event) → status transition table in @bb/domain plus one CAS-guarded writer in @bb/db (UPDATE … SET status WHERE id = ? AND status = ? RETURNING). Callers stop choosing target statuses and instead report what happened (14 events: turn.started, provision.failed, stop.completed, …); each event declares its supersession predicates (notDeleted, notStopRequested, …) once in the table; illegal or superseded transitions return a typed no-op that gets logged — never thrown, never silent.

Replaces: ~21 transition call sites picking statuses ad hoc, the throwing transitionThreadStatus API + tryTransition catch-shims, ALLOWED_TRANSITIONS, 3 module-level in-flight Sets (now one InFlightRpcGuard), and 5 direct .set({status}) sites on environments — including applyProvisionedEnvironment, which smuggled status through a metadata update.

Behavior-neutral by design: the tables encode observed behavior, including the questionable parts, marked // observed:. Tightening is a separate follow-up with its own QA.

Why

~30 race fix-commits in 2 months all had one shape: a late async result applied after the world moved on, each fixed with one more scattered guard (90 of them at peak). Status is now writable by exactly two functions; staleness is structurally impossible to forget and observable when it happens.

Review guide — where the judgment calls live

The inventories (the most important review artifact): header comments of packages/domain/test/thread-lifecycle.test.ts and packages/domain/test/environment-lifecycle.test.ts. Every transition site: from→to, assigned event, observed guards, and the classification of all 58+ status guards (became-predicate / non-lifecycle / suspicious).
The // observed: cells in packages/domain/src/thread-lifecycle.ts and environment-lifecycle.ts — current behaviors preserved verbatim that look wrong, e.g. provisioning + turn.completed → idle (a late turn-completion flips a revived thread to idle), and "a late provision success revives an in-flight destroy".
Four latent bugs found and deliberately NOT fixed (queued in the roadmap): generated-title renames silently dropped for fast-finishing threads; schedule dispatch path has no stopRequestedAt guard; queued-message dispatch never re-checks deletedAt/stopRequestedAt after claiming; thread-send applies error→active before daemon ack.
Stronger caller-side guards kept (where the inventory's weakest-common-set predicates are narrower than a specific caller's needs) — listed per-site in the migration commit bodies.

Behavior deltas (race-window only, both deliberate)

A duplicate turn.interrupted for an already-idle thread no longer re-runs best-effort idle pruning.
interruptActiveThreads no longer rolls back the whole batch when one thread's status moved between selection and apply — that thread no-ops with a log.

Validation

Workspace typecheck 29/29; @bb/domain 17, @bb/db 27, @bb/server 77 test files green; integration suite 22/22 (--force).
Race coverage at both layers: writer-level supersession/CAS-conflict tests (incl. a proxy-interleave harness for the conflict case, unreachable through the service API under better-sqlite3), and service-level tests through real paths (late start during in-flight stop, late provision for a tombstoned thread, redelivered stale turn-completion, double stop finalization, destroy completing after revive).
Exit criteria: status: inside .set( appears only in the two writers + creation defaults; old API zero hits outside inventory comments; transition-protecting guard survivors: threads 4, environments 7, each commented.
Not done: live dev-app smoke (stop mid-provisioning, delete mid-turn, daemon-kill mid-command).

The 11 commits tell the story in order: inventory+table → writer → three thread migration chunks → in-flight guard → environment trio → race tests → guard audit.

🤖 Generated with Claude Code

Step 1 of plans/server-lifecycle-transition-core.md: inventory all thread status transition call sites, derive an event vocabulary, and encode the observed (from, event, to) triples plus per-event supersession predicates as pure data in @bb/domain. Behavior-neutral: cells that look wrong are kept with `// observed:` comments; tightening is a separate follow-up. No caller changes yet. Event vocabulary (14, no payloads — the threads row stores no turn id and nothing in table/predicates/writer consumes event data): turn.started, turn.completed, turn.failed, turn.interrupted, runtime.exited, turn.dispatched, reprovision.started, start.succeeded, command.failed, provision.failed, workspace.lost, stop.completed, session.lost, runtime.observed-active. The full per-site inventory (19 transition call sites across 12 files, turn/completed splitting into 3 events) and the 58-line status-guard sweep classification (15 become predicates/table cells, 42 non-lifecycle, 1 suspicious + 3 suspicious call-site observations) live as the header comment of packages/domain/test/thread-lifecycle.test.ts for review. Notable findings: - internal/events.ts:313 skips turn/started activation when stopRequestedAt is set, contradicting plan decision point 3's parity assumption; turn.started therefore carries notStopRequested. - thread-schedule-sweep dispatch+activation has no stopRequestedAt guard. - queued-messages dispatch never re-checks deletedAt/stopRequestedAt in its transaction after claiming the message. - thread-send applies error→active optimistically at dispatch. - environment-level provision failure errors every live thread bound to the environment regardless of its own status (created/idle/active). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@deprecated

Step 2 of plans/server-lifecycle-transition-core.md. Adds applyThreadLifecycleEvent (own transaction, notifies status-changed only when applied) and applyThreadLifecycleEventInTransaction (caller owns the transaction and notification) to packages/db. In one transaction: load row, run the pure domain evaluator (THREAD_LIFECYCLE + supersession predicates), then UPDATE … WHERE id AND status = <loaded status> RETURNING as a compare-and-set. latestAttentionAt logic is shared with the old writer via statusTransitionNeedsAttention, byte-for-byte parity covered by tests. Returns a typed outcome instead of throwing: {applied: true, thread} | {applied: false, reason: "illegal-transition" | "superseded" | "not-found" | "cas-conflict", detail}. requireThreadLifecycleEventApplied(outcome) throws a typed ThreadLifecycleEventNotAppliedError for boundary callers where a no-op is a real 4xx. Outcome logging stays at the server layer (db has no logger). The old transitionThreadStatus / transitionThreadStatusInTransaction API is untouched but marked @deprecated; callers migrate in step 3. Tests (in-memory SQLite + real migrate): applied paths (standalone and in-transaction), illegal-transition / superseded (deletedAt, stopRequestedAt) / not-found no-ops leaving rows untouched and unnotified, sequential-stale second event, latestAttentionAt parity with the old writer across four representative transitions, and a genuine cas-conflict via a pass-through proxy that issues a real concurrent UPDATE between the writer's load and its CAS update (better-sqlite3's synchronous transactions make that interleave unreachable through the public API alone). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…vents Step 3a of plans/server-lifecycle-transition-core.md: migrate the hottest thread transition sites — daemon turn events, runtime exit, reprovision dispatch, and stop-completion finalize — from target-status writes (tryTransition / transitionThreadStatusInTransaction) to applyThreadLifecycleEvent with the inventoried events: turn.started, turn.completed, turn.failed, turn.interrupted, runtime.exited, reprovision.started, stop.completed, session.lost. Adds services/threads/lifecycle-outcome.ts, the one server-side wrapper pair that logs every non-applied outcome ({threadId, event, reason, detail}) so stale events become observable instead of silently swallowed. Guards deleted because they are now the event's predicates or table cells: - internal/events.ts turn/started: stopRequestedAt null guard (turn.started carries notStopRequested) and the pre-start/idle/error from-status guard (table cells). - internal/events.ts provider_process_exited: stopRequestedAt null guard (runtime.exited carries notStopRequested). - internal/turn-completed-events.ts: the stopRequestedAt guard for failed turns (turn.failed predicate) and the from-status guard for completed turns (table cells). Stronger caller-side guards kept (per the inventory's weakest-common-set judgment): - hasThreadStopBeforeTurnStarted / event-log staleness checks stay at the callers (rows cannot express them). - thread-turn-dispatch keeps the full idle-or-recoverable-errored guard: the error arm requires no provider thread id, which is stronger than the error->provisioning reprovision.started cell, and the composite cannot be split without changing dispatch routing. Behavior notes (race-window only): applyTurnCompletedEvent now gates its pruning side effects and returned nextStatus on outcome.applied, so a turn.interrupted arriving for an already-idle thread no longer re-runs idle pruning that the old code ran even when the transition was swallowed; pruning re-runs on the next applied idle transition. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ycle events Step 3b of plans/server-lifecycle-transition-core.md: migrate the thread-side provisioning transitions to the event API. - failThreadProvisioning (thread-provisioning-environment.ts) sends provision.failed instead of tryTransition(error). - recordEnvironmentProvisioningFailureInTransaction (environment-provisioning-internal.ts) sends provision.failed per live thread; the status-changed notify is now gated on outcome.applied, matching the old tryTransitionInTransaction shim's notify-on-success. - markLiveThreadsErroredAfterDestroySuccess (environment-cleanup-internal.ts) sends workspace.lost, same gating. Environment.status writes in these files are untouched (step 5). Stronger caller-side guards kept (weakest-common-set): the provision-context staleness checks (status === "provisioning" plus notDeleted/notArchived/notStopRequested) at failThreadProvisioning's call sites stay — provision.failed's table cells are wider (any live status, for environment-level failure) and only carry notDeleted, and those guards also gate context cleanup and the error-event append. Likewise listLiveEnvironmentThreads / listLiveThreadsInEnvironment filtering and shouldPreserveThreadProvisionCancellationOutcome (provisioningId-scoped, event-log) stay at the callers because they gate the event appends too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… transition API Step 3c of plans/server-lifecycle-transition-core.md: migrate the remaining transition sites and remove the target-status API entirely. - thread-schedule-sweep: due-schedule start dispatch sends turn.dispatched (the path's missing stopRequestedAt guard is a documented latent bug and is preserved — turn.dispatched declares no supersession predicates). - thread-send / parent-system-messages: dispatch activation sends turn.dispatched through requireThreadLifecycleEventApplied, preserving the old throwing-writer semantics at these boundaries (a stale dispatch still aborts the transaction). The error→active pre-ack optimistic activation in thread-send is preserved as the observed error cell. - queued-messages: auto-send claim sends turn.dispatched; the claim-without-re-check latent bug is preserved (no new in-tx re-check). - thread-lifecycle: command.failed, start.succeeded, session.lost / stop.completed (daemon-restart interruption batches), and runtime.observed-active (reconnect revival). The canActivateThreadAfterSuccessfulStart guard shrinks to isThreadStartActivationStale — its row-flag and from-status checks are now start.succeeded's predicates and table cells; only the event-log staleness checks remain. nextStatusForInterruptedThread is replaced by the reason→event mapping. - Deleted: services/threads/thread-transitions.ts (tryTransition shims) and its test (outcome semantics are covered by the db writer tests); transitionThreadStatus / transitionThreadStatusInTransaction / transitionThreadStatusRecord, InvalidThreadStatusTransitionError, TransitionThreadStatusInTransactionArgs, and ALLOWED_TRANSITIONS (now dead — THREAD_LIFECYCLE in @bb/domain is the single source of truth) from packages/db, plus their exports. - db tests: the transitionThreadStatus suite's read-state/attention tests are rewritten against applyThreadLifecycleEvent; legality and no-op cases are already covered by the domain table tests and the writer tests, and the writer's latestAttentionAt test now asserts expected values directly instead of comparing against the deleted old writer. Stronger caller-side guards kept (weakest-common-set): - thread-send keeps "dispatchKind === turn.submit || status === error": the error arm is dispatch routing (errored threads re-activate optimistically on thread.start dispatches, idle ones do not), stronger than the turn.dispatched cells. - queued-messages keeps the idle-only auto-send routing guard — it gates the entire claim-and-send flow, not just the transition. - finalizeStoppedThreadInTransaction keeps its active/pre-start routing: active picks the turn-interruption path; idle/error finalizes stay no-transition without emitting noise events. - Reconnect revival keeps the targeted SQL filters (status/deletedAt/ stopRequestedAt) and the event-log blocked-revival check. Behavior notes (race-window only): interruptActiveThreads no longer aborts the whole batch transaction when one thread's status changed between selection and apply — that thread's event no-ops and is logged, the rest proceed (previously an InvalidThreadStatusTransitionError threw away the entire batch including the appended interruption events). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Step 4 of plans/server-lifecycle-transition-core.md: the three module-level Sets in thread-lifecycle.ts (activeThreadStartRpcThreadIds, activeThreadStartGeneratedTitleSyncThreadIds, activeThreadStopRpcThreadIds) become a single InFlightRpcGuard with claim/release/isHeld keyed threadId × kind, defined in the lifecycle owner module. Process-local by design: durable cross-restart intent already lives on the thread row (stopRequestedAt/deletedAt); the guard only prevents duplicate concurrent RPCs in one process — no status-ladder growth. The generated-title-sync entry is a flag riding the in-flight thread.start RPC rather than a dedupe of its own, but its lifetime is exactly claim-at-dispatch / read-at-settlement / release-with-the-RPC keyed threadId × kind, so it is modeled as the third kind ("thread.start.title-sync") instead of keeping a parallel one-off Set. Behavior identical: the stop path's check-then-add was a single synchronous block, so it merges into one claim() call; the start path keeps its pre-dispatch isHeld checks (claiming there would mark the RPC in flight while the command is still being built, which observers like requestThreadStopForCurrentState route on). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Behavior-neutral first pass derived from the write-site inventory in the test header: 9 events, byPath targets for settled-state restores, and row-level supersession predicates (managed, cleanup intent, workspace path, destroyAttemptId match). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

applyEnvironmentLifecycleEvent (+InTransaction, +requireApplied) loads the row, evaluates ENVIRONMENT_LIFECYCLE, and compare-and-sets the status in one immediate transaction; the destroy claim re-asserts the cross-table thread conditions inside the same UPDATE. Adds the pure recordProvisionedEnvironmentWorkspace metadata writer (the status half of applyProvisionedEnvironmentRecord becomes a provision.succeeded event) and listStaleDestroyingManagedEnvironments for the sweep. The Proxy-interleave CAS helper moves to a shared test helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

All environment status writes now report what happened (provision.requested/succeeded/failed/cancelled, destroy.dispatched/ succeeded/failed/lost, cleanup.completed) through the logged CAS writer; the direct setters (setEnvironmentStatus, applyProvisionedEnvironmentRecord, setEnvironmentRecordDestroyed, claimEnvironmentDestroy, restoreEnvironmentAfterDestroyAttemptFailure, recoverStaleDestroyingEnvironmentCleanup) are deleted along with the dead claimManagedEnvironmentReprovisionRecord, the unreachable restoreEnvironmentAfterCleanupCancellation, and a provably no-op provision request on a freshly created environment. Surviving status guards are flow/routing concerns and carry comments saying so. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Server-service-level race tests for the lifecycle-event writers (step 6 of plans/server-lifecycle-transition-core.md), each driven through the real settlement/ingestion entry points and asserting outcome plus persisted rows: - thread-live-start-handoff: a thread.start that succeeds while its stop RPC is still in flight does not reactivate the thread (stop intent preserved). - environment-provisioning: a provision success settling after the thread was tombstoned by project deletion finalizes the thread instead of activating it, and routes the orphaned workspace into cleanup. - internal-events-tool-calls: a redelivered turn/completed for a settled idle thread is an illegal-transition no-op; the row is byte-for-byte untouched. - thread-stop-retry: re-finalizing an already-settled stop (daemon reconnect reconciliation) changes nothing and appends no duplicate events. - environment-cleanup: a destroy success settling after the environment was revived by a reprovision request is an illegal-transition no-op; the row and the revived thread are untouched. CAS-conflict interleaving stays covered by the db writer tests (packages/db/test/data/{thread,environment}-lifecycle.test.ts), which reach the compare-and-set branch via the proxy interleave harness; it is unreachable through the service API under synchronous transactions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Classifies the status guards left in services/environments after the lifecycle-event migration. Seven are lifecycle-relevant survivors; four already carried comments from the migration, and this adds the missing three (preflight precondition, cleanup-advance routing, provision dispatch routing) plus brief "not lifecycle" notes on the boundary-validation and stop-routing reads whose role was not obvious from context. The remaining grep hits are read-path/display/event-data checks that need no comment. No guard was provably subsumed by the writers' predicates, so none were deleted (behavior preserved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The transition tables are pure data, so the state machines can be visualized without a state-machine dependency: renderLifecycleMermaid turns a table + per-event supersession predicates into a Mermaid stateDiagram-v2, and docs/lifecycle-diagrams.md commits the rendered thread and environment machines — GitHub renders Mermaid natively, so the diagrams are visible in the repo and change visibly in PR diffs whenever a transition changes. Path-dependent environment targets (settle to ready with a workspace on disk, error/destroyed without) render as two annotated edges. A file-snapshot test keeps the doc in sync with the tables; regenerate with vitest -u as documented in the file header. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ymichael and others added 12 commits June 12, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server lifecycle: event transition table + CAS single-writer#129

Server lifecycle: event transition table + CAS single-writer#129
ymichael wants to merge 12 commits into
mainfrom
bb/repo-simplification-roadmap-thr_j5z6qkn6bc

ymichael commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ymichael commented Jun 12, 2026

What

Why

Review guide — where the judgment calls live

Behavior deltas (race-window only, both deliberate)

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant