Prevent shape removal and request crashes during server restart#4666
Prevent shape removal and request crashes during server restart#4666robacourt wants to merge 2 commits into
Conversation
Bug 6 surfaced as a 409 (must-refetch) on a healthy subquery shape after a StackSupervisor restart. Three targeted, defensive fixes, each with a deterministic regression test that fails without the fix: - consumer.ex: notify_materializer_of_new_changes/3 now catches the :exit (:noproc / transient :normal/:shutdown) from its inline materializer call. Previously, when the materializer died mid-call during shutdown the consumer crashed with a non-shutdown reason, routed through handle_writer_termination, and removed the shape from disk — leaving it half-removed (the SLC was already gone) and 409ing after restart. The pending :DOWN now drives a clean stop instead. - shape_status.ex: validate_shape_handle/3 rescues ArgumentError -> :error, so a request that wakes in the window where ShapeStatusOwner has freed the shape_meta_table (and not yet recreated it) falls back to SQLite rather than crashing. - api.ex: check_for_disk_updates/1 rescues ArgumentError -> :no_change for the same transient-ETS-absent window on the long-poll path. Note: a deterministic *oracle* repro was investigated but the candidate turned out to reproduce the bug-3 long-poll readiness race (passes once LONG_POLL_TIMEOUT is raised, with or without this fix), not bug 6's cleanup cascade. Bug 6 is a tight shutdown race; these lower-level tests pin each fix precisely instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4666 +/- ##
==========================================
- Coverage 60.16% 60.00% -0.16%
==========================================
Files 410 395 -15
Lines 44348 43747 -601
Branches 12581 12582 +1
==========================================
- Hits 26681 26251 -430
+ Misses 17588 17418 -170
+ Partials 79 78 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummaryThree small, defensive fixes that stop a subquery shape from being spuriously removed (and held HTTP requests from crashing) during a What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)1. The File: During supervisor teardown, if the materializer doesn't terminate within its shutdown timeout it is brutally killed, and the inline Given the design intent here is "let the monitored catch
:exit, reason ->
Logger.debug(fn ->
"Materializer notify call exited (#{inspect(reason)}); " <>
"deferring to monitored :DOWN for #{state.shape_handle}"
end)
:ok
endNot blocking — 2. Files: Both rescues are scoped to the whole function body, so any unrelated Issue ConformanceNo GitHub issue is linked on the PR (only the internal One note: CI shows an unrelated TypeScript failure ( Previous Review StatusN/A — first review of this PR. Review iteration: 1 | 2026-06-30 |
|
This has been converted to a draft for now as it's not ready for review yet |
Summary
Fixes bug 6 from the restart-aware oracle property-test triage (#4648 /
bugs.md): a healthy subquery shape returning a409 (must-refetch)after aStackSupervisorrestart.Three targeted, defensive fixes — each with a deterministic regression test that fails without the fix and passes with it.
Problem
Bug 6 is a shutdown race. During a restart's stack teardown:
Cleanup cascade. A dependency consumer's inline
GenServer.callinto its materializer (notify_materializer_of_new_changes/3) races the materializer's death and exits with:noproc. The consumer crashes with a non-shutdown reason, which routes throughhandle_writer_terminationand removes the shape from disk. Because theShapeLogCollectoris already gone mid-shutdown, the removal half-completes — the handle is deleted from SQLite but cleanup fails — so after the restart the handle is missing and a fresh poll on thatoptimized: trueshape gets a 409.Transient ETS-absent window. During the restart,
ShapeStatusOwnerfrees the per-stack ETS tables and the new owner recreates them. A long-poll held by Bandit can wake up in that window and read a now-missing table, raisingArgumentErrorand crashing the request handler.Solution
consumer.ex—notify_materializer_of_new_changes/3catches the:exit(:noproc, and transient:normal/:shutdown) from its inline materializer call. The pending monitored:DOWNthen drives a clean stop instead of the abnormal-shutdown path that removes the shape.shape_status.ex—validate_shape_handle/3rescuesArgumentError -> :error, so a request in the table-recreation window falls back to the live SQLite store instead of raising.api.ex—check_for_disk_updates/1rescuesArgumentError -> :no_changefor the same window on the long-poll path.Test Plan
Each fix has a deterministic unit test, verified to fail on the pre-fix code:
consumer_test.exs— "dependency consumer survives a :noproc from its materializer without removing the shape". Without the fix: consumer crashes{:noproc, …}and logs "Removing shape …".shape_status_test.exs— "validate_shape_handle/3 returns :error when the meta table is absent". Without the fix: raisesArgumentError.api_test.exs— "disk-update check survives ETS tables vanishing during a restart". Without the fix:ArgumentErrorcrashesserve.Generated with Claude Code