Skip to content

fix: tolerate slow guest exec-server bind on boot (issue #3)#14

Closed
ZhiXiao-Lin wants to merge 1 commit into
mainfrom
fix/exec-server-startup-race
Closed

fix: tolerate slow guest exec-server bind on boot (issue #3)#14
ZhiXiao-Lin wants to merge 1 commit into
mainfrom
fix/exec-server-startup-race

Conversation

@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor

Issue

Closes the v2.0.7 recurrence of #3: WARN Exec socket appeared but heartbeat failed, exec will not be available, and run -it PTY Connection refused on a cold boot.

Root cause (verified against v2.0.7 code)

Guest-init binds its exec (vsock 4089) and PTY (vsock 4090) servers only late in boot — after the virtio-fs pivot, network bring-up, and the container spawn (guest/init/src/main.rs Steps 8/8.5). It cannot start them earlier: namespace::spawn_isolated forks the container and runs non-async-signal-safe code (tracing/alloc) in the child before exec, which is only safe because the parent is single-threaded at fork time. Spawning the server threads first would risk a malloc/log-lock deadlock in the forked child.

On a slow or loaded host that late bind can land past the host's fixed 10 s exec-readiness budget (runtime/src/vm/ready.rs), producing the false warning. For run -it the same slow bind made the PTY attach race the server.

Note: a v2.0.1 fix kept the VM alive after container exit to dodge this race, but it was reverted in v2.0.4 (it broke clean container-exit / exit-code semantics). The readiness probe already early-returns when the VM exits, so a fast-exiting container no longer warns — the residual failure is purely a slow-but-alive guest exceeding 10 s.

Fix

  • Raise wait_for_exec_ready budget 10 s → 30 s. Stays cheap for healthy boxes (returns the instant the heartbeat passes) and still bails out immediately when the VM exits, so a fast-exiting container never stalls. Because boot blocks on this wait before the run -it PTY attach, and the guest brings exec+PTY up back-to-back, the existing 10 s PTY connect retry then succeeds — no PTY-budget change needed.
  • Reword the two readiness warnings: exec/attach connect on demand, so a timed-out probe no longer claims exec is unavailable.
  • Tidy the dead /sbin/init BOX_EXEC_EXEC default to /bin/sh (the runtime always sets the var; /sbin/init is absent on Alpine — the original WARN Exec socket appeared but heartbeat failed, exec will not be available #3 symptom), and a stale resolve_oci_entrypoint doc comment.

Verification

  • cargo fmt / clippy clean; runtime + guest-init unit tests pass (macOS).
  • Built and smoke-tested on a real Linux/KVM host (x86_64):
    • Fast-exit echo box runs and exits 0 with no spurious warning (the has_exited early-return fires).
    • Long-lived box logs Exec server heartbeat passed and exec returns output — exec path unaffected.
  • Two independent adversarial reviews of the diff: fix-is-sound.

Scope / caveat

This addresses the timing race on an otherwise-healthy guest. A hard boot failure where the guest never binds the server (e.g. --network bridge mode when guest eth0 setup fails and aborts guest-init) is a separate fault and will surface after the wait rather than be masked. The reporter's exact environment for the v2.0.7 "+1" isn't fully known, so it's worth confirming with them after this lands.

On a cold first run on a slow/loaded host, guest-init binds its exec
(vsock 4089) and PTY (vsock 4090) servers only late in boot — after the
virtio-fs pivot, network bring-up, and the container spawn. It cannot
start them earlier: spawn_isolated forks the container and runs
non-async-signal-safe code (tracing/alloc) before exec, which is only
safe because the parent is single-threaded at fork time; starting the
server threads first would risk a malloc/log-lock deadlock in the child.

That late bind could land past the host's fixed 10s exec-readiness
budget, producing the false "Exec socket appeared but heartbeat failed,
exec will not be available" warning. For `run -it` the same slow bind
made the PTY attach race the server ("Connection refused").

Raise the wait_for_exec_ready budget to 30s. It stays cheap for healthy
boxes (returns the moment the heartbeat passes) and already bails out the
instant the VM exits, so a fast-exiting container never stalls for the
full budget. Because boot blocks on this exec wait before the `run -it`
PTY attach, and the guest brings exec+PTY up back-to-back, the existing
10s PTY connect retry then succeeds — no PTY-budget change needed.

Also: reword the two readiness warnings (exec/attach connect on demand,
so a timed-out probe no longer claims exec is unavailable); fix the dead
`/sbin/init` BOX_EXEC_EXEC default to `/bin/sh` (the runtime always sets
the var, and /sbin/init is absent on Alpine — the original #3 symptom);
and correct a stale resolve_oci_entrypoint doc comment.

Scope: addresses the timing race on a healthy guest. A hard boot failure
where the guest never binds (e.g. bridge-mode eth0 setup failing) is a
separate fault and surfaces after the wait rather than being masked.
@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor Author

Superseded by #15, which replaces this interim 10s→30s budget bump with the proper structural fix: event-driven, VM-liveness-bounded readiness (no fixed budget to outrun) + early-bind vsock servers + a real PID 1 reaper. Closing in favor of #15.

ZhiXiao-Lin pushed a commit that referenced this pull request Jun 11, 2026
…sue #3)

Restructures the exec/PTY readiness path so boot waits for a real readiness
EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing
the interim 10s→30s band-aid.

P1 — bind early, serve late. Split exec_server/pty_server into
bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener)
(the accept loop). run_init now binds both vsock listeners on the main thread
right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up
and the container fork, then spawns the accept loops after the fork (Step 8).
Binding adds no thread, so the single-threaded-at-fork invariant that keeps
spawn_isolated safe is preserved. The listen backlog is filled from boot, so a
host connect QUEUES instead of being refused — this removes the `run -it` PTY
"Connection refused". CLOEXEC keeps the forked container from inheriting the
listeners.

P2 — event-driven, liveness-bounded readiness. Early binding makes the host
`connect` succeed immediately, so heartbeat()'s (timeout-less) read would block
until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound
each connect+heartbeat attempt (tokio timeout), return at once when the VM
exits (has_exited, zombie-aware — fast-exit containers never stall), and treat
a large absolute cap purely as a backstop against a wedged-but-alive guest. A
healthy guest passes the heartbeat the moment its accept loop runs, however
late in a slow cold boot — so the false "heartbeat failed" warning is gone
without a fixed budget to outrun.

Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default →
`/bin/sh`, and the stale resolve_oci_entrypoint doc comment.

Deferred: an explicit guest→host "ready" beacon on a new vsock port was
considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a
guest connect-out, which contradicts the assumed listen=false direction for
guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat
achieves the same correctness without guessing cross-process vsock semantics.
Supersedes the interim 30s fix (PR #14).
ZhiXiao-Lin added a commit that referenced this pull request Jun 11, 2026
… reaper (issue #3) (#15)

* refactor(init): early-bind vsock servers + event-driven readiness (issue #3)

Restructures the exec/PTY readiness path so boot waits for a real readiness
EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing
the interim 10s→30s band-aid.

P1 — bind early, serve late. Split exec_server/pty_server into
bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener)
(the accept loop). run_init now binds both vsock listeners on the main thread
right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up
and the container fork, then spawns the accept loops after the fork (Step 8).
Binding adds no thread, so the single-threaded-at-fork invariant that keeps
spawn_isolated safe is preserved. The listen backlog is filled from boot, so a
host connect QUEUES instead of being refused — this removes the `run -it` PTY
"Connection refused". CLOEXEC keeps the forked container from inheriting the
listeners.

P2 — event-driven, liveness-bounded readiness. Early binding makes the host
`connect` succeed immediately, so heartbeat()'s (timeout-less) read would block
until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound
each connect+heartbeat attempt (tokio timeout), return at once when the VM
exits (has_exited, zombie-aware — fast-exit containers never stall), and treat
a large absolute cap purely as a backstop against a wedged-but-alive guest. A
healthy guest passes the heartbeat the moment its accept loop runs, however
late in a slow cold boot — so the false "heartbeat failed" warning is gone
without a fixed budget to outrun.

Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default →
`/bin/sh`, and the stale resolve_oci_entrypoint doc comment.

Deferred: an explicit guest→host "ready" beacon on a new vsock port was
considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a
guest connect-out, which contradicts the assumed listen=false direction for
guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat
achieves the same correctness without guessing cross-process vsock semantics.
Supersedes the interim 30s fix (PR #14).

* fix(init): real PID1 reaper — reap orphans without stealing exec/PTY exit codes

guest-init runs as PID 1 but only waited on the container pid, so reparented
grandchildren and the sidecar were never reaped and accumulated as zombies for
the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with
the exec/PTY handlers, which waitpid their own children to read the real exit
code — a stolen child makes the handler see ECHILD and report a bogus exit 0
(exec_server.rs). That tension is exactly why a prior fix narrowed the loop to
waitpid(container_pid), trading the zombie leak for correct exec codes.

Resolve both with a small reaper registry:
- New `reaper` module: handlers mark their child pid MANAGED across the spawn
  (the lock is held across fork, closing the spawn/register race for fast-exiting
  commands like `exec -- false`); an RAII guard unregisters on every return path.
- The supervision loop now peeks exited children non-destructively with
  `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM
  lifecycle, unchanged); MANAGED children -> left for their handler to reap (real
  exit codes preserved); everything else (orphans + sidecar) -> reaped here.
- exec one-shot + streaming spawns and the PTY fork register their children;
  their existing waitpid/try_wait paths are unchanged.

Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper
loop" comments true again, with no regression to exec/PTY or container exit
codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes
+ orphan reaping. Builds on P1+P2 (issue #3).

---------

Co-authored-by: Roy Lin <roylin@a3s.box>
ZhiXiao-Lin added a commit that referenced this pull request Jun 11, 2026
* refactor(init): early-bind vsock servers + event-driven readiness (issue #3)

Restructures the exec/PTY readiness path so boot waits for a real readiness
EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing
the interim 10s→30s band-aid.

P1 — bind early, serve late. Split exec_server/pty_server into
bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener)
(the accept loop). run_init now binds both vsock listeners on the main thread
right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up
and the container fork, then spawns the accept loops after the fork (Step 8).
Binding adds no thread, so the single-threaded-at-fork invariant that keeps
spawn_isolated safe is preserved. The listen backlog is filled from boot, so a
host connect QUEUES instead of being refused — this removes the `run -it` PTY
"Connection refused". CLOEXEC keeps the forked container from inheriting the
listeners.

P2 — event-driven, liveness-bounded readiness. Early binding makes the host
`connect` succeed immediately, so heartbeat()'s (timeout-less) read would block
until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound
each connect+heartbeat attempt (tokio timeout), return at once when the VM
exits (has_exited, zombie-aware — fast-exit containers never stall), and treat
a large absolute cap purely as a backstop against a wedged-but-alive guest. A
healthy guest passes the heartbeat the moment its accept loop runs, however
late in a slow cold boot — so the false "heartbeat failed" warning is gone
without a fixed budget to outrun.

Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default →
`/bin/sh`, and the stale resolve_oci_entrypoint doc comment.

Deferred: an explicit guest→host "ready" beacon on a new vsock port was
considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a
guest connect-out, which contradicts the assumed listen=false direction for
guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat
achieves the same correctness without guessing cross-process vsock semantics.
Supersedes the interim 30s fix (PR #14).

* fix(init): real PID1 reaper — reap orphans without stealing exec/PTY exit codes

guest-init runs as PID 1 but only waited on the container pid, so reparented
grandchildren and the sidecar were never reaped and accumulated as zombies for
the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with
the exec/PTY handlers, which waitpid their own children to read the real exit
code — a stolen child makes the handler see ECHILD and report a bogus exit 0
(exec_server.rs). That tension is exactly why a prior fix narrowed the loop to
waitpid(container_pid), trading the zombie leak for correct exec codes.

Resolve both with a small reaper registry:
- New `reaper` module: handlers mark their child pid MANAGED across the spawn
  (the lock is held across fork, closing the spawn/register race for fast-exiting
  commands like `exec -- false`); an RAII guard unregisters on every return path.
- The supervision loop now peeks exited children non-destructively with
  `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM
  lifecycle, unchanged); MANAGED children -> left for their handler to reap (real
  exit codes preserved); everything else (orphans + sidecar) -> reaped here.
- exec one-shot + streaming spawns and the PTY fork register their children;
  their existing waitpid/try_wait paths are unchanged.

Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper
loop" comments true again, with no regression to exec/PTY or container exit
codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes
+ orphan reaping. Builds on P1+P2 (issue #3).

* docs: P2 deferred-main-spawn design (GO-WITH-CONDITIONS)

Adversarial mapping of the #15+#18 base resolved both crux uncertainties:
console logs come free via process-wide fd inheritance (Stdio::inherit, not the
exec path's piped), and the multi-threaded fork hazard is avoided by spawning the
deferred main via Command::spawn (not spawn_isolated's raw fork; the VM already
isolates). Conditions: single spawn-main (CAS) + atomic late container-pid handoff
to the reaper. Includes risk-ranked blockers + a 7-phase plan whose Phase 0 is a
single KVM prototype that de-risks the whole feature.

---------

Co-authored-by: Roy Lin <roylin@a3s.box>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant