Skip to content

Fix silent reconciler skips on lock contention#2332

Draft
ideaship wants to merge 4 commits into
mainfrom
fix/silent-skip-on-lock-acquire-timeout
Draft

Fix silent reconciler skips on lock contention#2332
ideaship wants to merge 4 commits into
mainfrom
fix/silent-skip-on-lock-acquire-timeout

Conversation

@ideaship

@ideaship ideaship commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix explicit inventory reconciliations that silently skip when they cannot
acquire the reconciler lock, leaving osism sync inventory waiting for output
that will never arrive.

Also make related reconciler failure paths publish terminal output, return a
non-zero CLI result when the output wait itself times out, and prevent explicit
and periodic reconciler processes from rewriting the inventory concurrently.

Problem

osism sync inventory dispatches an explicit reconciler.run task and waits
for a terminal marker in that task's Redis output stream.

Previously, reconciler.run attempted to acquire its lock for 20 seconds. If
the lock remained held, the task skipped its entire body, published no terminal
marker, returned None, and was recorded by Celery as successful. The waiting
CLI could not distinguish that silent skip from a still-running task, so it
remained blocked until its configured task-output timeout.

This occurred during a deploy when seven reconciler tasks were dispatched in a
short burst. Five tasks silently skipped after approximately 20 seconds. The
CLI was waiting on one of those tasks with --task-timeout 1800, so the deploy
blocked for the full 30 minutes before the output wait timed out and aborted
the deploy.

Simply failing after the lock timeout is insufficient. The current lock holder
may have started before the caller's inventory changes existed, so the explicit
sync request must eventually execute its own reconciliation.

Changes

  • Return a non-zero CLI result when waiting for reconciler output times out.
  • Retry an explicit reconciler task up to five times at fixed five-second
    intervals when it cannot acquire the lock.
  • Requeue the same Celery task ID so the waiting CLI continues following the
    correct output stream.
  • Publish non-terminal retry status messages so users see the contention and
    the output inactivity deadline is refreshed.
  • Publish a non-zero terminal result after retry exhaustion and preserve the
    Celery failure state.
  • Guard explicit reconciliation failure paths so subprocess startup, output
    handling, process timeout, and task-lock failures publish terminal output
    when possible.
  • Kill and reap /run.sh after its existing 60-second process wait expires,
    and always release an acquired lock.
  • Use one execution lock key for explicit and periodic reconciler tasks.
    Explicit tasks retry on contention; periodic tasks continue to coalesce by
    skipping a contended run.

Scope limits

Terminal output publication remains best-effort. If Redis itself is
unavailable, the task cannot publish a terminal marker.

This change preserves the existing 60-second lock auto-release time. Extending
the lease for reconciliations that run longer than 60 seconds is outside this
PR's scope.

Testing

  • Unit tests cover CLI output-wait timeout failure.
  • Unit tests cover contention retries, status publication failures, retry
    exhaustion, and no-wait behavior.
  • Unit tests cover subprocess and task failure cleanup paths.
  • Unit tests verify explicit and periodic tasks use the same execution lock.
  • Full test suite, Black, and flake8 pass.

ideaship added 4 commits June 9, 2026 07:44
`osism sync inventory` waits for a terminal marker in the reconciler
task's output stream. If no further output arrives before
`--task-timeout`, the command logs an error but currently returns no
failure status, so callers can treat the timed-out command as
successful.

This timeout is normally five minutes, but callers can override it. The
deploy that exposed the reconciler lock race used `--task-timeout 1800`
and therefore waited the full 30 minutes before reaching this path.

Return a non-zero result when the output wait times out so automation
can detect the failure, following the exit-code convention established
by #2313. This does not fix the missing terminal marker itself; later
commits address the task-side causes.

AI-assisted: Codex/GPT 5.5
Signed-off-by: Roger Luethi <luethi@osism.tech>
`osism sync inventory` dispatches an explicit reconciler task and waits
for that task's output stream to publish a terminal marker. During a
deploy, seven reconciler tasks were dispatched in a short burst. Five
failed to acquire the reconciler lock within 20 seconds, returned `None`
as successful Celery tasks, and never published a terminal marker. The
CLI waiting on one of those tasks consequently waited until its
configured `--task-timeout 1800` expired 30 minutes later.

Failing immediately on contention would stop the wait but would not
fulfil the requested sync: the current lock holder may have started
before the caller's inventory change existed. Retry the same Celery task
up to five times at fixed five-second intervals so the waiting caller
continues following the correct task ID and the requested reconciliation
can still run.

Publish a non-terminal status before each retry. Besides explaining the
delay, each status resets `fetch_task_output`'s inactivity deadline.
After retry exhaustion, publish a non-zero terminal result and re-raise
the Celery failure.

`osism sync inventory --no-wait` returns immediately and dispatches the
same task with output publication disabled. Retry these background tasks
as well so contention does not silently discard the requested sync. They
publish no retry or terminal stream messages because no caller is
waiting, but remain Celery failures if retries are exhausted.

AI-assisted: Codex/GPT 5.5
Signed-off-by: Roger Luethi <luethi@osism.tech>
The lock-timeout path is not the only way an awaited reconciler task can
finish without publishing the terminal marker consumed by
`fetch_task_output`. Failures starting `/run.sh`, decoding its output,
or publishing output can bypass `finish_task_output`. The existing
60-second process wait can also raise `TimeoutExpired`, and
`check_task_lock_and_exit()` can raise `SystemExit` before a lock
exists. In each case the CLI otherwise waits until its configured task
timeout, which was 30 minutes in the deploy that exposed this class of
failure.

Run explicit reconciliation through a guarded helper. For every
non-retry failure, publish a best-effort explanatory message and
non-zero terminal result. Kill and reap a subprocess that exceeds its
process wait, and always release an acquired lock.

Preserve normal Celery `Retry` and retry-exhaustion control flow so a
contention retry is not reported as an ordinary task failure. Terminal
publication remains best-effort because Redis may itself be the failing
component.

AI-assisted: Codex/GPT 5.5
Signed-off-by: Roger Luethi <luethi@osism.tech>
Explicit `osism sync inventory` tasks and periodic reconciler tasks use
different lock keys even though both execute `/run.sh`. They can
therefore run concurrently and rewrite the same generated inventory
files. Contention retries for explicit tasks do not prevent this because
the periodic task's different lock is always independent.

Use one execution lock key for both task types. Explicit sync requests
continue retrying when the shared lock is held so their requested
reconciliation eventually runs. Periodic tasks continue coalescing by
skipping a run when they cannot acquire the lock; a later scheduled run
will reconcile the changes.

Keep the existing 60-second auto-release time and all other execution
behavior unchanged. Extending the lease for reconciliations that run
longer than 60 seconds is outside this commit's scope.

AI-assisted: Codex/GPT 5.5
Signed-off-by: Roger Luethi <luethi@osism.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

2 participants