Skip to content

Prevent shape removal and request crashes during server restart#4666

Draft
robacourt wants to merge 2 commits into
mainfrom
rob/subquery-restore-bug-6
Draft

Prevent shape removal and request crashes during server restart#4666
robacourt wants to merge 2 commits into
mainfrom
rob/subquery-restore-bug-6

Conversation

@robacourt

@robacourt robacourt commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes bug 6 from the restart-aware oracle property-test triage (#4648 / bugs.md): a healthy subquery shape returning a 409 (must-refetch) after a StackSupervisor restart.

Three targeted, defensive fixes — each with a deterministic regression test that fails without the fix and passes with it.

Problem

Bug 6 is a shutdown race. During a restart's stack teardown:

  1. Cleanup cascade. A dependency consumer's inline GenServer.call into its materializer (notify_materializer_of_new_changes/3) races the materializer's death and exits with :noproc. The consumer crashes with a non-shutdown reason, which routes through handle_writer_termination and removes the shape from disk. Because the ShapeLogCollector is already gone mid-shutdown, the removal half-completes — the handle is deleted from SQLite but cleanup fails — so after the restart the handle is missing and a fresh poll on that optimized: true shape gets a 409.

  2. Transient ETS-absent window. During the restart, ShapeStatusOwner frees the per-stack ETS tables and the new owner recreates them. A long-poll held by Bandit can wake up in that window and read a now-missing table, raising ArgumentError and crashing the request handler.

Solution

  • consumer.exnotify_materializer_of_new_changes/3 catches the :exit (:noproc, and transient :normal/:shutdown) from its inline materializer call. The pending monitored :DOWN then drives a clean stop instead of the abnormal-shutdown path that removes the shape.
  • shape_status.exvalidate_shape_handle/3 rescues ArgumentError -> :error, so a request in the table-recreation window falls back to the live SQLite store instead of raising.
  • api.excheck_for_disk_updates/1 rescues ArgumentError -> :no_change for the same window on the long-poll path.

Test Plan

Each fix has a deterministic unit test, verified to fail on the pre-fix code:

  • consumer_test.exs — "dependency consumer survives a :noproc from its materializer without removing the shape". Without the fix: consumer crashes {:noproc, …} and logs "Removing shape …".
  • shape_status_test.exs — "validate_shape_handle/3 returns :error when the meta table is absent". Without the fix: raises ArgumentError.
  • api_test.exs — "disk-update check survives ETS tables vanishing during a restart". Without the fix: ArgumentError crashes serve.

Generated with Claude Code

Bug 6 surfaced as a 409 (must-refetch) on a healthy subquery shape after a
StackSupervisor restart. Three targeted, defensive fixes, each with a
deterministic regression test that fails without the fix:

- consumer.ex: notify_materializer_of_new_changes/3 now catches the :exit
  (:noproc / transient :normal/:shutdown) from its inline materializer call.
  Previously, when the materializer died mid-call during shutdown the
  consumer crashed with a non-shutdown reason, routed through
  handle_writer_termination, and removed the shape from disk — leaving it
  half-removed (the SLC was already gone) and 409ing after restart. The
  pending :DOWN now drives a clean stop instead.
- shape_status.ex: validate_shape_handle/3 rescues ArgumentError -> :error,
  so a request that wakes in the window where ShapeStatusOwner has freed the
  shape_meta_table (and not yet recreated it) falls back to SQLite rather
  than crashing.
- api.ex: check_for_disk_updates/1 rescues ArgumentError -> :no_change for
  the same transient-ETS-absent window on the long-poll path.

Note: a deterministic *oracle* repro was investigated but the candidate
turned out to reproduce the bug-3 long-poll readiness race (passes once
LONG_POLL_TIMEOUT is raised, with or without this fix), not bug 6's
cleanup cascade. Bug 6 is a tight shutdown race; these lower-level tests
pin each fix precisely instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.00%. Comparing base (044a8d5) to head (4791606).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4666      +/-   ##
==========================================
- Coverage   60.16%   60.00%   -0.16%     
==========================================
  Files         410      395      -15     
  Lines       44348    43747     -601     
  Branches    12581    12582       +1     
==========================================
- Hits        26681    26251     -430     
+ Misses      17588    17418     -170     
+ Partials       79       78       -1     
Flag Coverage Δ
electric-telemetry ?
elixir ?
packages/agents 72.64% <ø> (ø)
packages/agents-mcp 77.70% <ø> (ø)
packages/agents-mobile 80.67% <ø> (ø)
packages/agents-runtime 83.73% <ø> (ø)
packages/agents-server 75.47% <ø> (ø)
packages/agents-server-ui 8.32% <ø> (ø)
packages/electric-ax 51.06% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.83% <ø> (+0.11%) ⬆️
packages/y-electric 56.05% <ø> (ø)
typescript 60.00% <ø> (+<0.01%) ⬆️
unit-tests 60.00% <ø> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread .changeset/restore-shutdown-shape-removal.md Outdated
@robacourt robacourt requested a review from alco June 30, 2026 15:38
@robacourt robacourt self-assigned this Jun 30, 2026
@robacourt robacourt changed the title Prevent shape removal and request crashes during server restart (bug 6) Prevent shape removal and request crashes during server restart Jun 30, 2026
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Three small, defensive fixes that stop a subquery shape from being spuriously removed (and held HTTP requests from crashing) during a StackSupervisor restart, addressing bug 6 from the restart-aware oracle triage. The changes are narrowly scoped, each carries a deterministic regression test, and a changeset is included. Overall this is a clean, well-reasoned PR.

What's Working Well

  • Root-cause-driven, not symptom-masking. The consumer.ex catch correctly defers the death decision to the existing handle_materializer_down/2 path via the already-established monitor (consumer.ex:217). I traced the flow: swallowing the inline GenServer.call exit prevents the consumer from crashing with a non-shutdown reason, which would otherwise route through terminate/2 -> ShapeCleaner.handle_writer_termination/3 (the reason-fallback clause at shape_cleaner.ex:121) -> remove_shape_async. The pending :DOWN instead hits handle_materializer_down/2, which for :shutdown/{:shutdown, _} does a clean {:stop, reason, state} with no removal. The fix matches the stated mechanism.
  • The :error/:no_change fallbacks are genuinely safe. validate_shape_handle/3 returning :error makes shape_cache.ex:90-92 fall back to fetch_handle_by_shape against the live SQLite store — verified the caller actually does this. check_for_disk_updates/1 returning :no_change just lets the out-of-bounds long-poll path (api.ex:901) return its normal 400.
  • Tests are precise and isolate each fix. The consumer test patches only Materializer.new_changes to exit({:noproc, …}) and asserts both that remove_shape is never called and that the consumer stays alive — exactly the regression. The api test drives the real :out_of_bounds_timeout -> check_for_disk_updates path and forces the raise on the third resolve_shape_handle.
  • Excellent inline comments explaining the restart race window for each rescue/catch, and a changeset is present (@core/sync-service, matching repo convention).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

1. The catch enumerates exit reasons but omits :killed — a brutal-kill window remains.

File: packages/sync-service/lib/electric/shapes/consumer.ex:1048-1051

During supervisor teardown, if the materializer doesn't terminate within its shutdown timeout it is brutally killed, and the inline GenServer.call then exits with {:killed, {GenServer, :call, …}}. That is not matched by any of the four clauses, so the consumer would still crash with a non-shutdown reason and route into the shape-removal path — the very thing this PR is preventing, just via a narrower window.

Given the design intent here is "let the monitored :DOWN decide what to do, never crash the consumer on the inline notify call," a broader catch is arguably more correct than enumerating reasons, since any swallowed exit is always backstopped by the :DOWN arriving and being handled by handle_materializer_down/2. Consider either adding :killed/{:killed, _} or collapsing to a logged catch-all:

catch
  :exit, reason ->
    Logger.debug(fn ->
      "Materializer notify call exited (#{inspect(reason)}); " <>
        "deferring to monitored :DOWN for #{state.shape_handle}"
    end)
    :ok
end

Not blocking — :noproc is by far the dominant race and is covered — but worth a thought since the enumerated list invites exactly this kind of gap.

2. rescue ArgumentError -> … is broader than the documented cause.

Files: shape_status.ex:294-301, api.ex:973-979

Both rescues are scoped to the whole function body, so any unrelated ArgumentError (now or introduced later inside resolve_shape_handle/its callees) would be silently reclassified as :error/:no_change rather than surfacing. In practice the only raising operation today is the missing-ETS-table lookup, so the risk is low and the fallback is safe — but a brief note, or narrowing the rescue to the lookup call, would guard against future drift. Optional.

Issue Conformance

No GitHub issue is linked on the PR (only the internal #4648 / bugs.md triage reference in the description). Per project convention this is a minor warning, but the PR description is unusually thorough — it documents the exact race, the failure mode, and a per-fix test plan — so requirements traceability is well covered in practice. The implementation matches the described solution for all three fixes with no scope creep.

One note: CI shows an unrelated TypeScript failure (test/wake-registry.test.ts, hook timeout) reported by codecov; it is in a different package and appears unrelated to these Elixir changes, but worth confirming it's a known flake before merge.

Previous Review Status

N/A — first review of this PR.


Review iteration: 1 | 2026-06-30

@robacourt robacourt marked this pull request as draft June 30, 2026 15:54
@robacourt robacourt removed the request for review from alco June 30, 2026 15:54
@robacourt

Copy link
Copy Markdown
Contributor Author

This has been converted to a draft for now as it's not ready for review yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant