Fix silent reconciler skips on lock contention#2332
Draft
ideaship wants to merge 4 commits into
Draft
Conversation
`osism sync inventory` waits for a terminal marker in the reconciler task's output stream. If no further output arrives before `--task-timeout`, the command logs an error but currently returns no failure status, so callers can treat the timed-out command as successful. This timeout is normally five minutes, but callers can override it. The deploy that exposed the reconciler lock race used `--task-timeout 1800` and therefore waited the full 30 minutes before reaching this path. Return a non-zero result when the output wait times out so automation can detect the failure, following the exit-code convention established by #2313. This does not fix the missing terminal marker itself; later commits address the task-side causes. AI-assisted: Codex/GPT 5.5 Signed-off-by: Roger Luethi <luethi@osism.tech>
`osism sync inventory` dispatches an explicit reconciler task and waits for that task's output stream to publish a terminal marker. During a deploy, seven reconciler tasks were dispatched in a short burst. Five failed to acquire the reconciler lock within 20 seconds, returned `None` as successful Celery tasks, and never published a terminal marker. The CLI waiting on one of those tasks consequently waited until its configured `--task-timeout 1800` expired 30 minutes later. Failing immediately on contention would stop the wait but would not fulfil the requested sync: the current lock holder may have started before the caller's inventory change existed. Retry the same Celery task up to five times at fixed five-second intervals so the waiting caller continues following the correct task ID and the requested reconciliation can still run. Publish a non-terminal status before each retry. Besides explaining the delay, each status resets `fetch_task_output`'s inactivity deadline. After retry exhaustion, publish a non-zero terminal result and re-raise the Celery failure. `osism sync inventory --no-wait` returns immediately and dispatches the same task with output publication disabled. Retry these background tasks as well so contention does not silently discard the requested sync. They publish no retry or terminal stream messages because no caller is waiting, but remain Celery failures if retries are exhausted. AI-assisted: Codex/GPT 5.5 Signed-off-by: Roger Luethi <luethi@osism.tech>
The lock-timeout path is not the only way an awaited reconciler task can finish without publishing the terminal marker consumed by `fetch_task_output`. Failures starting `/run.sh`, decoding its output, or publishing output can bypass `finish_task_output`. The existing 60-second process wait can also raise `TimeoutExpired`, and `check_task_lock_and_exit()` can raise `SystemExit` before a lock exists. In each case the CLI otherwise waits until its configured task timeout, which was 30 minutes in the deploy that exposed this class of failure. Run explicit reconciliation through a guarded helper. For every non-retry failure, publish a best-effort explanatory message and non-zero terminal result. Kill and reap a subprocess that exceeds its process wait, and always release an acquired lock. Preserve normal Celery `Retry` and retry-exhaustion control flow so a contention retry is not reported as an ordinary task failure. Terminal publication remains best-effort because Redis may itself be the failing component. AI-assisted: Codex/GPT 5.5 Signed-off-by: Roger Luethi <luethi@osism.tech>
Explicit `osism sync inventory` tasks and periodic reconciler tasks use different lock keys even though both execute `/run.sh`. They can therefore run concurrently and rewrite the same generated inventory files. Contention retries for explicit tasks do not prevent this because the periodic task's different lock is always independent. Use one execution lock key for both task types. Explicit sync requests continue retrying when the shared lock is held so their requested reconciliation eventually runs. Periodic tasks continue coalescing by skipping a run when they cannot acquire the lock; a later scheduled run will reconcile the changes. Keep the existing 60-second auto-release time and all other execution behavior unchanged. Extending the lease for reconciliations that run longer than 60 seconds is outside this commit's scope. AI-assisted: Codex/GPT 5.5 Signed-off-by: Roger Luethi <luethi@osism.tech>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix explicit inventory reconciliations that silently skip when they cannot
acquire the reconciler lock, leaving
osism sync inventorywaiting for outputthat will never arrive.
Also make related reconciler failure paths publish terminal output, return a
non-zero CLI result when the output wait itself times out, and prevent explicit
and periodic reconciler processes from rewriting the inventory concurrently.
Problem
osism sync inventorydispatches an explicitreconciler.runtask and waitsfor a terminal marker in that task's Redis output stream.
Previously,
reconciler.runattempted to acquire its lock for 20 seconds. Ifthe lock remained held, the task skipped its entire body, published no terminal
marker, returned
None, and was recorded by Celery as successful. The waitingCLI could not distinguish that silent skip from a still-running task, so it
remained blocked until its configured task-output timeout.
This occurred during a deploy when seven reconciler tasks were dispatched in a
short burst. Five tasks silently skipped after approximately 20 seconds. The
CLI was waiting on one of those tasks with
--task-timeout 1800, so the deployblocked for the full 30 minutes before the output wait timed out and aborted
the deploy.
Simply failing after the lock timeout is insufficient. The current lock holder
may have started before the caller's inventory changes existed, so the explicit
sync request must eventually execute its own reconciliation.
Changes
intervals when it cannot acquire the lock.
correct output stream.
the output inactivity deadline is refreshed.
Celery failure state.
handling, process timeout, and task-lock failures publish terminal output
when possible.
/run.shafter its existing 60-second process wait expires,and always release an acquired lock.
Explicit tasks retry on contention; periodic tasks continue to coalesce by
skipping a contended run.
Scope limits
Terminal output publication remains best-effort. If Redis itself is
unavailable, the task cannot publish a terminal marker.
This change preserves the existing 60-second lock auto-release time. Extending
the lease for reconciliations that run longer than 60 seconds is outside this
PR's scope.
Testing
exhaustion, and no-wait behavior.