Skip to content

Server lifecycle: event transition table + CAS single-writer#129

Open
ymichael wants to merge 12 commits into
mainfrom
bb/repo-simplification-roadmap-thr_j5z6qkn6bc
Open

Server lifecycle: event transition table + CAS single-writer#129
ymichael wants to merge 12 commits into
mainfrom
bb/repo-simplification-roadmap-thr_j5z6qkn6bc

Conversation

@ymichael

Copy link
Copy Markdown
Owner

What

Thread and environment status changes now flow through a single owner each: a pure (status, event) → status transition table in @bb/domain plus one CAS-guarded writer in @bb/db (UPDATE … SET status WHERE id = ? AND status = ? RETURNING). Callers stop choosing target statuses and instead report what happened (14 events: turn.started, provision.failed, stop.completed, …); each event declares its supersession predicates (notDeleted, notStopRequested, …) once in the table; illegal or superseded transitions return a typed no-op that gets logged — never thrown, never silent.

Replaces: ~21 transition call sites picking statuses ad hoc, the throwing transitionThreadStatus API + tryTransition catch-shims, ALLOWED_TRANSITIONS, 3 module-level in-flight Sets (now one InFlightRpcGuard), and 5 direct .set({status}) sites on environments — including applyProvisionedEnvironment, which smuggled status through a metadata update.

Behavior-neutral by design: the tables encode observed behavior, including the questionable parts, marked // observed:. Tightening is a separate follow-up with its own QA.

Why

~30 race fix-commits in 2 months all had one shape: a late async result applied after the world moved on, each fixed with one more scattered guard (90 of them at peak). Status is now writable by exactly two functions; staleness is structurally impossible to forget and observable when it happens.

Review guide — where the judgment calls live

  1. The inventories (the most important review artifact): header comments of packages/domain/test/thread-lifecycle.test.ts and packages/domain/test/environment-lifecycle.test.ts. Every transition site: from→to, assigned event, observed guards, and the classification of all 58+ status guards (became-predicate / non-lifecycle / suspicious).
  2. The // observed: cells in packages/domain/src/thread-lifecycle.ts and environment-lifecycle.ts — current behaviors preserved verbatim that look wrong, e.g. provisioning + turn.completed → idle (a late turn-completion flips a revived thread to idle), and "a late provision success revives an in-flight destroy".
  3. Four latent bugs found and deliberately NOT fixed (queued in the roadmap): generated-title renames silently dropped for fast-finishing threads; schedule dispatch path has no stopRequestedAt guard; queued-message dispatch never re-checks deletedAt/stopRequestedAt after claiming; thread-send applies error→active before daemon ack.
  4. Stronger caller-side guards kept (where the inventory's weakest-common-set predicates are narrower than a specific caller's needs) — listed per-site in the migration commit bodies.

Behavior deltas (race-window only, both deliberate)

  • A duplicate turn.interrupted for an already-idle thread no longer re-runs best-effort idle pruning.
  • interruptActiveThreads no longer rolls back the whole batch when one thread's status moved between selection and apply — that thread no-ops with a log.

Validation

  • Workspace typecheck 29/29; @bb/domain 17, @bb/db 27, @bb/server 77 test files green; integration suite 22/22 (--force).
  • Race coverage at both layers: writer-level supersession/CAS-conflict tests (incl. a proxy-interleave harness for the conflict case, unreachable through the service API under better-sqlite3), and service-level tests through real paths (late start during in-flight stop, late provision for a tombstoned thread, redelivered stale turn-completion, double stop finalization, destroy completing after revive).
  • Exit criteria: status: inside .set( appears only in the two writers + creation defaults; old API zero hits outside inventory comments; transition-protecting guard survivors: threads 4, environments 7, each commented.
  • Not done: live dev-app smoke (stop mid-provisioning, delete mid-turn, daemon-kill mid-command).

The 11 commits tell the story in order: inventory+table → writer → three thread migration chunks → in-flight guard → environment trio → race tests → guard audit.

🤖 Generated with Claude Code

ymichael and others added 12 commits June 12, 2026 16:21
Step 1 of plans/server-lifecycle-transition-core.md: inventory all thread
status transition call sites, derive an event vocabulary, and encode the
observed (from, event, to) triples plus per-event supersession predicates
as pure data in @bb/domain. Behavior-neutral: cells that look wrong are
kept with `// observed:` comments; tightening is a separate follow-up.
No caller changes yet.

Event vocabulary (14, no payloads — the threads row stores no turn id and
nothing in table/predicates/writer consumes event data):
turn.started, turn.completed, turn.failed, turn.interrupted,
runtime.exited, turn.dispatched, reprovision.started, start.succeeded,
command.failed, provision.failed, workspace.lost, stop.completed,
session.lost, runtime.observed-active.

The full per-site inventory (19 transition call sites across 12 files,
turn/completed splitting into 3 events) and the 58-line status-guard
sweep classification (15 become predicates/table cells, 42 non-lifecycle,
1 suspicious + 3 suspicious call-site observations) live as the header
comment of packages/domain/test/thread-lifecycle.test.ts for review.

Notable findings:
- internal/events.ts:313 skips turn/started activation when
  stopRequestedAt is set, contradicting plan decision point 3's parity
  assumption; turn.started therefore carries notStopRequested.
- thread-schedule-sweep dispatch+activation has no stopRequestedAt guard.
- queued-messages dispatch never re-checks deletedAt/stopRequestedAt
  in its transaction after claiming the message.
- thread-send applies error→active optimistically at dispatch.
- environment-level provision failure errors every live thread bound to
  the environment regardless of its own status (created/idle/active).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Step 2 of plans/server-lifecycle-transition-core.md. Adds
applyThreadLifecycleEvent (own transaction, notifies status-changed only
when applied) and applyThreadLifecycleEventInTransaction (caller owns the
transaction and notification) to packages/db. In one transaction: load
row, run the pure domain evaluator (THREAD_LIFECYCLE + supersession
predicates), then UPDATE … WHERE id AND status = <loaded status>
RETURNING as a compare-and-set. latestAttentionAt logic is shared with
the old writer via statusTransitionNeedsAttention, byte-for-byte parity
covered by tests.

Returns a typed outcome instead of throwing:
{applied: true, thread} | {applied: false, reason:
"illegal-transition" | "superseded" | "not-found" | "cas-conflict",
detail}. requireThreadLifecycleEventApplied(outcome) throws a typed
ThreadLifecycleEventNotAppliedError for boundary callers where a no-op
is a real 4xx. Outcome logging stays at the server layer (db has no
logger).

The old transitionThreadStatus / transitionThreadStatusInTransaction API
is untouched but marked @deprecated; callers migrate in step 3.

Tests (in-memory SQLite + real migrate): applied paths (standalone and
in-transaction), illegal-transition / superseded (deletedAt,
stopRequestedAt) / not-found no-ops leaving rows untouched and
unnotified, sequential-stale second event, latestAttentionAt parity with
the old writer across four representative transitions, and a genuine
cas-conflict via a pass-through proxy that issues a real concurrent
UPDATE between the writer's load and its CAS update (better-sqlite3's
synchronous transactions make that interleave unreachable through the
public API alone).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…vents

Step 3a of plans/server-lifecycle-transition-core.md: migrate the hottest
thread transition sites — daemon turn events, runtime exit, reprovision
dispatch, and stop-completion finalize — from target-status writes
(tryTransition / transitionThreadStatusInTransaction) to
applyThreadLifecycleEvent with the inventoried events: turn.started,
turn.completed, turn.failed, turn.interrupted, runtime.exited,
reprovision.started, stop.completed, session.lost.

Adds services/threads/lifecycle-outcome.ts, the one server-side wrapper
pair that logs every non-applied outcome ({threadId, event, reason,
detail}) so stale events become observable instead of silently swallowed.

Guards deleted because they are now the event's predicates or table cells:
- internal/events.ts turn/started: stopRequestedAt null guard
  (turn.started carries notStopRequested) and the pre-start/idle/error
  from-status guard (table cells).
- internal/events.ts provider_process_exited: stopRequestedAt null guard
  (runtime.exited carries notStopRequested).
- internal/turn-completed-events.ts: the stopRequestedAt guard for failed
  turns (turn.failed predicate) and the from-status guard for completed
  turns (table cells).

Stronger caller-side guards kept (per the inventory's weakest-common-set
judgment):
- hasThreadStopBeforeTurnStarted / event-log staleness checks stay at the
  callers (rows cannot express them).
- thread-turn-dispatch keeps the full idle-or-recoverable-errored guard:
  the error arm requires no provider thread id, which is stronger than the
  error->provisioning reprovision.started cell, and the composite cannot
  be split without changing dispatch routing.

Behavior notes (race-window only): applyTurnCompletedEvent now gates its
pruning side effects and returned nextStatus on outcome.applied, so a
turn.interrupted arriving for an already-idle thread no longer re-runs
idle pruning that the old code ran even when the transition was swallowed;
pruning re-runs on the next applied idle transition.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ycle events

Step 3b of plans/server-lifecycle-transition-core.md: migrate the
thread-side provisioning transitions to the event API.

- failThreadProvisioning (thread-provisioning-environment.ts) sends
  provision.failed instead of tryTransition(error).
- recordEnvironmentProvisioningFailureInTransaction
  (environment-provisioning-internal.ts) sends provision.failed per live
  thread; the status-changed notify is now gated on outcome.applied,
  matching the old tryTransitionInTransaction shim's notify-on-success.
- markLiveThreadsErroredAfterDestroySuccess
  (environment-cleanup-internal.ts) sends workspace.lost, same gating.

Environment.status writes in these files are untouched (step 5).

Stronger caller-side guards kept (weakest-common-set): the
provision-context staleness checks (status === "provisioning" plus
notDeleted/notArchived/notStopRequested) at failThreadProvisioning's call
sites stay — provision.failed's table cells are wider (any live status,
for environment-level failure) and only carry notDeleted, and those guards
also gate context cleanup and the error-event append. Likewise
listLiveEnvironmentThreads / listLiveThreadsInEnvironment filtering and
shouldPreserveThreadProvisionCancellationOutcome (provisioningId-scoped,
event-log) stay at the callers because they gate the event appends too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… transition API

Step 3c of plans/server-lifecycle-transition-core.md: migrate the
remaining transition sites and remove the target-status API entirely.

- thread-schedule-sweep: due-schedule start dispatch sends turn.dispatched
  (the path's missing stopRequestedAt guard is a documented latent bug and
  is preserved — turn.dispatched declares no supersession predicates).
- thread-send / parent-system-messages: dispatch activation sends
  turn.dispatched through requireThreadLifecycleEventApplied, preserving
  the old throwing-writer semantics at these boundaries (a stale dispatch
  still aborts the transaction). The error→active pre-ack optimistic
  activation in thread-send is preserved as the observed error cell.
- queued-messages: auto-send claim sends turn.dispatched; the
  claim-without-re-check latent bug is preserved (no new in-tx re-check).
- thread-lifecycle: command.failed, start.succeeded, session.lost /
  stop.completed (daemon-restart interruption batches), and
  runtime.observed-active (reconnect revival). The
  canActivateThreadAfterSuccessfulStart guard shrinks to
  isThreadStartActivationStale — its row-flag and from-status checks are
  now start.succeeded's predicates and table cells; only the event-log
  staleness checks remain. nextStatusForInterruptedThread is replaced by
  the reason→event mapping.
- Deleted: services/threads/thread-transitions.ts (tryTransition shims)
  and its test (outcome semantics are covered by the db writer tests);
  transitionThreadStatus / transitionThreadStatusInTransaction /
  transitionThreadStatusRecord, InvalidThreadStatusTransitionError,
  TransitionThreadStatusInTransactionArgs, and ALLOWED_TRANSITIONS (now
  dead — THREAD_LIFECYCLE in @bb/domain is the single source of truth)
  from packages/db, plus their exports.
- db tests: the transitionThreadStatus suite's read-state/attention tests
  are rewritten against applyThreadLifecycleEvent; legality and no-op
  cases are already covered by the domain table tests and the writer
  tests, and the writer's latestAttentionAt test now asserts expected
  values directly instead of comparing against the deleted old writer.

Stronger caller-side guards kept (weakest-common-set):
- thread-send keeps "dispatchKind === turn.submit || status === error":
  the error arm is dispatch routing (errored threads re-activate
  optimistically on thread.start dispatches, idle ones do not), stronger
  than the turn.dispatched cells.
- queued-messages keeps the idle-only auto-send routing guard — it gates
  the entire claim-and-send flow, not just the transition.
- finalizeStoppedThreadInTransaction keeps its active/pre-start routing:
  active picks the turn-interruption path; idle/error finalizes stay
  no-transition without emitting noise events.
- Reconnect revival keeps the targeted SQL filters (status/deletedAt/
  stopRequestedAt) and the event-log blocked-revival check.

Behavior notes (race-window only): interruptActiveThreads no longer
aborts the whole batch transaction when one thread's status changed
between selection and apply — that thread's event no-ops and is logged,
the rest proceed (previously an InvalidThreadStatusTransitionError threw
away the entire batch including the appended interruption events).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Step 4 of plans/server-lifecycle-transition-core.md: the three
module-level Sets in thread-lifecycle.ts (activeThreadStartRpcThreadIds,
activeThreadStartGeneratedTitleSyncThreadIds,
activeThreadStopRpcThreadIds) become a single InFlightRpcGuard with
claim/release/isHeld keyed threadId × kind, defined in the lifecycle
owner module. Process-local by design: durable cross-restart intent
already lives on the thread row (stopRequestedAt/deletedAt); the guard
only prevents duplicate concurrent RPCs in one process — no status-ladder
growth.

The generated-title-sync entry is a flag riding the in-flight
thread.start RPC rather than a dedupe of its own, but its lifetime is
exactly claim-at-dispatch / read-at-settlement / release-with-the-RPC
keyed threadId × kind, so it is modeled as the third kind
("thread.start.title-sync") instead of keeping a parallel one-off Set.

Behavior identical: the stop path's check-then-add was a single
synchronous block, so it merges into one claim() call; the start path
keeps its pre-dispatch isHeld checks (claiming there would mark the RPC
in flight while the command is still being built, which observers like
requestThreadStopForCurrentState route on).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Behavior-neutral first pass derived from the write-site inventory in the
test header: 9 events, byPath targets for settled-state restores, and
row-level supersession predicates (managed, cleanup intent, workspace
path, destroyAttemptId match).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
applyEnvironmentLifecycleEvent (+InTransaction, +requireApplied) loads
the row, evaluates ENVIRONMENT_LIFECYCLE, and compare-and-sets the
status in one immediate transaction; the destroy claim re-asserts the
cross-table thread conditions inside the same UPDATE. Adds the pure
recordProvisionedEnvironmentWorkspace metadata writer (the status half
of applyProvisionedEnvironmentRecord becomes a provision.succeeded
event) and listStaleDestroyingManagedEnvironments for the sweep. The
Proxy-interleave CAS helper moves to a shared test helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All environment status writes now report what happened
(provision.requested/succeeded/failed/cancelled, destroy.dispatched/
succeeded/failed/lost, cleanup.completed) through the logged CAS
writer; the direct setters (setEnvironmentStatus,
applyProvisionedEnvironmentRecord, setEnvironmentRecordDestroyed,
claimEnvironmentDestroy, restoreEnvironmentAfterDestroyAttemptFailure,
recoverStaleDestroyingEnvironmentCleanup) are deleted along with the
dead claimManagedEnvironmentReprovisionRecord, the unreachable
restoreEnvironmentAfterCleanupCancellation, and a provably no-op
provision request on a freshly created environment. Surviving status
guards are flow/routing concerns and carry comments saying so.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Server-service-level race tests for the lifecycle-event writers (step 6 of
plans/server-lifecycle-transition-core.md), each driven through the real
settlement/ingestion entry points and asserting outcome plus persisted rows:

- thread-live-start-handoff: a thread.start that succeeds while its stop RPC
  is still in flight does not reactivate the thread (stop intent preserved).
- environment-provisioning: a provision success settling after the thread was
  tombstoned by project deletion finalizes the thread instead of activating
  it, and routes the orphaned workspace into cleanup.
- internal-events-tool-calls: a redelivered turn/completed for a settled idle
  thread is an illegal-transition no-op; the row is byte-for-byte untouched.
- thread-stop-retry: re-finalizing an already-settled stop (daemon reconnect
  reconciliation) changes nothing and appends no duplicate events.
- environment-cleanup: a destroy success settling after the environment was
  revived by a reprovision request is an illegal-transition no-op; the row
  and the revived thread are untouched.

CAS-conflict interleaving stays covered by the db writer tests
(packages/db/test/data/{thread,environment}-lifecycle.test.ts), which reach
the compare-and-set branch via the proxy interleave harness; it is
unreachable through the service API under synchronous transactions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Classifies the status guards left in services/environments after the
lifecycle-event migration. Seven are lifecycle-relevant survivors; four
already carried comments from the migration, and this adds the missing
three (preflight precondition, cleanup-advance routing, provision dispatch
routing) plus brief "not lifecycle" notes on the boundary-validation and
stop-routing reads whose role was not obvious from context. The remaining
grep hits are read-path/display/event-data checks that need no comment.
No guard was provably subsumed by the writers' predicates, so none were
deleted (behavior preserved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The transition tables are pure data, so the state machines can be
visualized without a state-machine dependency: renderLifecycleMermaid
turns a table + per-event supersession predicates into a Mermaid
stateDiagram-v2, and docs/lifecycle-diagrams.md commits the rendered
thread and environment machines — GitHub renders Mermaid natively, so
the diagrams are visible in the repo and change visibly in PR diffs
whenever a transition changes.

Path-dependent environment targets (settle to ready with a workspace on
disk, error/destroyed without) render as two annotated edges. A
file-snapshot test keeps the doc in sync with the tables; regenerate
with vitest -u as documented in the file header.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant