fix: tolerate slow guest exec-server bind on boot (issue #3) by ZhiXiao-Lin · Pull Request #14 · AI45Lab/Box

ZhiXiao-Lin · 2026-06-11T02:07:13Z

Issue

Closes the v2.0.7 recurrence of #3: WARN Exec socket appeared but heartbeat failed, exec will not be available, and run -it PTY Connection refused on a cold boot.

Root cause (verified against v2.0.7 code)

Guest-init binds its exec (vsock 4089) and PTY (vsock 4090) servers only late in boot — after the virtio-fs pivot, network bring-up, and the container spawn (guest/init/src/main.rs Steps 8/8.5). It cannot start them earlier: namespace::spawn_isolated forks the container and runs non-async-signal-safe code (tracing/alloc) in the child before exec, which is only safe because the parent is single-threaded at fork time. Spawning the server threads first would risk a malloc/log-lock deadlock in the forked child.

On a slow or loaded host that late bind can land past the host's fixed 10 s exec-readiness budget (runtime/src/vm/ready.rs), producing the false warning. For run -it the same slow bind made the PTY attach race the server.

Note: a v2.0.1 fix kept the VM alive after container exit to dodge this race, but it was reverted in v2.0.4 (it broke clean container-exit / exit-code semantics). The readiness probe already early-returns when the VM exits, so a fast-exiting container no longer warns — the residual failure is purely a slow-but-alive guest exceeding 10 s.

Fix

Raise wait_for_exec_ready budget 10 s → 30 s. Stays cheap for healthy boxes (returns the instant the heartbeat passes) and still bails out immediately when the VM exits, so a fast-exiting container never stalls. Because boot blocks on this wait before the run -it PTY attach, and the guest brings exec+PTY up back-to-back, the existing 10 s PTY connect retry then succeeds — no PTY-budget change needed.
Reword the two readiness warnings: exec/attach connect on demand, so a timed-out probe no longer claims exec is unavailable.
Tidy the dead /sbin/init BOX_EXEC_EXEC default to /bin/sh (the runtime always sets the var; /sbin/init is absent on Alpine — the original WARN Exec socket appeared but heartbeat failed, exec will not be available #3 symptom), and a stale resolve_oci_entrypoint doc comment.

Verification

cargo fmt / clippy clean; runtime + guest-init unit tests pass (macOS).
Built and smoke-tested on a real Linux/KVM host (x86_64):
- Fast-exit echo box runs and exits 0 with no spurious warning (the has_exited early-return fires).
- Long-lived box logs Exec server heartbeat passed and exec returns output — exec path unaffected.
Two independent adversarial reviews of the diff: fix-is-sound.

Scope / caveat

This addresses the timing race on an otherwise-healthy guest. A hard boot failure where the guest never binds the server (e.g. --network bridge mode when guest eth0 setup fails and aborts guest-init) is a separate fault and will surface after the wait rather than be masked. The reporter's exact environment for the v2.0.7 "+1" isn't fully known, so it's worth confirming with them after this lands.

On a cold first run on a slow/loaded host, guest-init binds its exec (vsock 4089) and PTY (vsock 4090) servers only late in boot — after the virtio-fs pivot, network bring-up, and the container spawn. It cannot start them earlier: spawn_isolated forks the container and runs non-async-signal-safe code (tracing/alloc) before exec, which is only safe because the parent is single-threaded at fork time; starting the server threads first would risk a malloc/log-lock deadlock in the child. That late bind could land past the host's fixed 10s exec-readiness budget, producing the false "Exec socket appeared but heartbeat failed, exec will not be available" warning. For `run -it` the same slow bind made the PTY attach race the server ("Connection refused"). Raise the wait_for_exec_ready budget to 30s. It stays cheap for healthy boxes (returns the moment the heartbeat passes) and already bails out the instant the VM exits, so a fast-exiting container never stalls for the full budget. Because boot blocks on this exec wait before the `run -it` PTY attach, and the guest brings exec+PTY up back-to-back, the existing 10s PTY connect retry then succeeds — no PTY-budget change needed. Also: reword the two readiness warnings (exec/attach connect on demand, so a timed-out probe no longer claims exec is unavailable); fix the dead `/sbin/init` BOX_EXEC_EXEC default to `/bin/sh` (the runtime always sets the var, and /sbin/init is absent on Alpine — the original #3 symptom); and correct a stale resolve_oci_entrypoint doc comment. Scope: addresses the timing race on a healthy guest. A hard boot failure where the guest never binds (e.g. bridge-mode eth0 setup failing) is a separate fault and surfaces after the wait rather than being masked.

ZhiXiao-Lin · 2026-06-11T04:18:00Z

Superseded by #15, which replaces this interim 10s→30s budget bump with the proper structural fix: event-driven, VM-liveness-bounded readiness (no fixed budget to outrun) + early-bind vsock servers + a real PID 1 reaper. Closing in favor of #15.

…sue #3) Restructures the exec/PTY readiness path so boot waits for a real readiness EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing the interim 10s→30s band-aid. P1 — bind early, serve late. Split exec_server/pty_server into bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener) (the accept loop). run_init now binds both vsock listeners on the main thread right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up and the container fork, then spawns the accept loops after the fork (Step 8). Binding adds no thread, so the single-threaded-at-fork invariant that keeps spawn_isolated safe is preserved. The listen backlog is filled from boot, so a host connect QUEUES instead of being refused — this removes the `run -it` PTY "Connection refused". CLOEXEC keeps the forked container from inheriting the listeners. P2 — event-driven, liveness-bounded readiness. Early binding makes the host `connect` succeed immediately, so heartbeat()'s (timeout-less) read would block until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt (tokio timeout), return at once when the VM exits (has_exited, zombie-aware — fast-exit containers never stall), and treat a large absolute cap purely as a backstop against a wedged-but-alive guest. A healthy guest passes the heartbeat the moment its accept loop runs, however late in a slow cold boot — so the false "heartbeat failed" warning is gone without a fixed budget to outrun. Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default → `/bin/sh`, and the stale resolve_oci_entrypoint doc comment. Deferred: an explicit guest→host "ready" beacon on a new vsock port was considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a guest connect-out, which contradicts the assumed listen=false direction for guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat achieves the same correctness without guessing cross-process vsock semantics. Supersedes the interim 30s fix (PR #14).

… reaper (issue #3) (#15) * refactor(init): early-bind vsock servers + event-driven readiness (issue #3) Restructures the exec/PTY readiness path so boot waits for a real readiness EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing the interim 10s→30s band-aid. P1 — bind early, serve late. Split exec_server/pty_server into bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener) (the accept loop). run_init now binds both vsock listeners on the main thread right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up and the container fork, then spawns the accept loops after the fork (Step 8). Binding adds no thread, so the single-threaded-at-fork invariant that keeps spawn_isolated safe is preserved. The listen backlog is filled from boot, so a host connect QUEUES instead of being refused — this removes the `run -it` PTY "Connection refused". CLOEXEC keeps the forked container from inheriting the listeners. P2 — event-driven, liveness-bounded readiness. Early binding makes the host `connect` succeed immediately, so heartbeat()'s (timeout-less) read would block until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt (tokio timeout), return at once when the VM exits (has_exited, zombie-aware — fast-exit containers never stall), and treat a large absolute cap purely as a backstop against a wedged-but-alive guest. A healthy guest passes the heartbeat the moment its accept loop runs, however late in a slow cold boot — so the false "heartbeat failed" warning is gone without a fixed budget to outrun. Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default → `/bin/sh`, and the stale resolve_oci_entrypoint doc comment. Deferred: an explicit guest→host "ready" beacon on a new vsock port was considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a guest connect-out, which contradicts the assumed listen=false direction for guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat achieves the same correctness without guessing cross-process vsock semantics. Supersedes the interim 30s fix (PR #14). * fix(init): real PID1 reaper — reap orphans without stealing exec/PTY exit codes guest-init runs as PID 1 but only waited on the container pid, so reparented grandchildren and the sidecar were never reaped and accumulated as zombies for the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with the exec/PTY handlers, which waitpid their own children to read the real exit code — a stolen child makes the handler see ECHILD and report a bogus exit 0 (exec_server.rs). That tension is exactly why a prior fix narrowed the loop to waitpid(container_pid), trading the zombie leak for correct exec codes. Resolve both with a small reaper registry: - New `reaper` module: handlers mark their child pid MANAGED across the spawn (the lock is held across fork, closing the spawn/register race for fast-exiting commands like `exec -- false`); an RAII guard unregisters on every return path. - The supervision loop now peeks exited children non-destructively with `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM lifecycle, unchanged); MANAGED children -> left for their handler to reap (real exit codes preserved); everything else (orphans + sidecar) -> reaped here. - exec one-shot + streaming spawns and the PTY fork register their children; their existing waitpid/try_wait paths are unchanged. Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper loop" comments true again, with no regression to exec/PTY or container exit codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes + orphan reaping. Builds on P1+P2 (issue #3). --------- Co-authored-by: Roy Lin <roylin@a3s.box>

* refactor(init): early-bind vsock servers + event-driven readiness (issue #3) Restructures the exec/PTY readiness path so boot waits for a real readiness EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing the interim 10s→30s band-aid. P1 — bind early, serve late. Split exec_server/pty_server into bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener) (the accept loop). run_init now binds both vsock listeners on the main thread right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up and the container fork, then spawns the accept loops after the fork (Step 8). Binding adds no thread, so the single-threaded-at-fork invariant that keeps spawn_isolated safe is preserved. The listen backlog is filled from boot, so a host connect QUEUES instead of being refused — this removes the `run -it` PTY "Connection refused". CLOEXEC keeps the forked container from inheriting the listeners. P2 — event-driven, liveness-bounded readiness. Early binding makes the host `connect` succeed immediately, so heartbeat()'s (timeout-less) read would block until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt (tokio timeout), return at once when the VM exits (has_exited, zombie-aware — fast-exit containers never stall), and treat a large absolute cap purely as a backstop against a wedged-but-alive guest. A healthy guest passes the heartbeat the moment its accept loop runs, however late in a slow cold boot — so the false "heartbeat failed" warning is gone without a fixed budget to outrun. Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default → `/bin/sh`, and the stale resolve_oci_entrypoint doc comment. Deferred: an explicit guest→host "ready" beacon on a new vsock port was considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a guest connect-out, which contradicts the assumed listen=false direction for guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat achieves the same correctness without guessing cross-process vsock semantics. Supersedes the interim 30s fix (PR #14). * fix(init): real PID1 reaper — reap orphans without stealing exec/PTY exit codes guest-init runs as PID 1 but only waited on the container pid, so reparented grandchildren and the sidecar were never reaped and accumulated as zombies for the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with the exec/PTY handlers, which waitpid their own children to read the real exit code — a stolen child makes the handler see ECHILD and report a bogus exit 0 (exec_server.rs). That tension is exactly why a prior fix narrowed the loop to waitpid(container_pid), trading the zombie leak for correct exec codes. Resolve both with a small reaper registry: - New `reaper` module: handlers mark their child pid MANAGED across the spawn (the lock is held across fork, closing the spawn/register race for fast-exiting commands like `exec -- false`); an RAII guard unregisters on every return path. - The supervision loop now peeks exited children non-destructively with `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM lifecycle, unchanged); MANAGED children -> left for their handler to reap (real exit codes preserved); everything else (orphans + sidecar) -> reaped here. - exec one-shot + streaming spawns and the PTY fork register their children; their existing waitpid/try_wait paths are unchanged. Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper loop" comments true again, with no regression to exec/PTY or container exit codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes + orphan reaping. Builds on P1+P2 (issue #3). * docs: P2 deferred-main-spawn design (GO-WITH-CONDITIONS) Adversarial mapping of the #15+#18 base resolved both crux uncertainties: console logs come free via process-wide fd inheritance (Stdio::inherit, not the exec path's piped), and the multi-threaded fork hazard is avoided by spawning the deferred main via Command::spawn (not spawn_isolated's raw fork; the VM already isolates). Conditions: single spawn-main (CAS) + atomic late container-pid handoff to the reaper. Includes risk-ranked blockers + a 7-phase plan whose Phase 0 is a single KVM prototype that de-risks the whole feature. --------- Co-authored-by: Roy Lin <roylin@a3s.box>

This was referenced Jun 11, 2026

WARN Exec socket appeared but heartbeat failed, exec will not be available #3

Open

refactor(init): event-driven readiness + early-bind vsock + real PID1 reaper (issue #3) #15

Merged

ZhiXiao-Lin closed this Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tolerate slow guest exec-server bind on boot (issue #3)#14

fix: tolerate slow guest exec-server bind on boot (issue #3)#14
ZhiXiao-Lin wants to merge 1 commit into
mainfrom
fix/exec-server-startup-race

ZhiXiao-Lin commented Jun 11, 2026

Uh oh!

ZhiXiao-Lin commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhiXiao-Lin commented Jun 11, 2026

Issue

Root cause (verified against v2.0.7 code)

Fix

Verification

Scope / caveat

Uh oh!

ZhiXiao-Lin commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant