refactor(init): event-driven readiness + early-bind vsock + real PID1 reaper (issue #3) by ZhiXiao-Lin · Pull Request #15 · AI45Lab/Box

ZhiXiao-Lin · 2026-06-11T04:17:34Z

The proper, structural fix for #3 (the recurring WARN Exec socket appeared but heartbeat failed / run -it Connection refused), plus the PID-1 reaper that the same area needs. Supersedes #14 (the interim 10s→30s budget bump): this removes that band-aid and replaces it with an event-driven, VM-liveness-bounded wait.

Why

On a slow/loaded cold boot, guest-init binds its exec (vsock 4089) and PTY (4090) servers only late in boot (after virtio-fs pivot, network bring-up, container spawn), so the bind could land past the host's fixed readiness budget → false warning + -it race. It can't bind earlier naively: spawn_isolated runs non-async-signal-safe code before exec, safe only because the parent is single-threaded at fork. Separately, guest-init (as PID 1) only waited on the container pid, so reparented orphans + the sidecar were never reaped.

What (3 commits)

P1 — bind early, serve late. Split exec/PTY servers into bind() (pure socket/bind/listen syscalls, on the main thread before the network + fork — fork-safe, fills the listen backlog so a host connect queues instead of being refused) and serve() (accept loop, spawned after the fork). Removes the -it "Connection refused".

P2 — event-driven, liveness-bounded readiness. Early binding makes the host connect succeed and heartbeat()'s (timeout-less) read block until the guest accepts, so wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt, return at once when the VM exits (has_exited — fast-exit containers never stall), and use a large cap only as a backstop. A healthy guest passes the heartbeat the moment its accept loop runs, however slow the boot — no fixed budget to outrun. Also: dead /sbin/init → /bin/sh default + stale doc comment.

P3 — real PID 1 reaper. New reaper registry: handlers mark their child pid MANAGED across the spawn (lock held across fork closes the spawn/register race for fast-exiting commands); the supervision loop peeks exited children non-destructively with waitid(WNOWAIT) and routes — container → reap+propagate exit code; MANAGED → left for the handler (real exec/PTY codes preserved); orphans + sidecar → reaped here. Fixes the zombie leak without the waitpid(-1) race that would corrupt exec codes (an ECHILD'd handler reports exit 0).

Testing (real KVM + unit)

Unit: ~2300 workspace lib tests pass (0 fail), incl. new reaper registry tests.
Real-VM core_smoke: 12/14 pass (incl. interactive_pty, foreground_exit_code, lifecycle, pause/kill/wait, ports, compose, volumes); the 2 failures (network_connect_disconnect_before_start, restart_policy_monitor_recovers_dead_box) fail identically on the clean baseline — pre-existing, unrelated to this change. host_smoke command matrix passes.
Issue WARN Exec socket appeared but heartbeat failed, exec will not be available #3 before/after: with a 15s boot delay, the old code emits the heartbeat failed warning at 10s and exec fails (Connection reset); this branch waits out the 17s boot, no warning, exec works.
P3 exit codes (the risk): container exit 7→7; exec exit 42→42, false→1, true→0, exit 3→3 — no regression. Reaper keeps 0 zombies.

Honest caveat

I could not empirically reproduce the P3 zombie leak: shell-created orphans got reaped within the container tree (busybox ash reaps its subshell bg jobs) on both old and new code. P3 is architecturally correct (old code provably never waitpid(-1)s, so a genuine double-fork daemon's reparented grandchild + the sidecar would leak) and verified to not regress exit codes, but its practical impact is lower than the analysis implied. P2's fix is the high-value, verified one.

Built/verified on x86_64 Linux/KVM; the suite needs A3S_REGISTRY_MIRRORS (Docker Hub egress was blocked in the test env) and a3s-box-shim built in the same profile.

…sue #3) Restructures the exec/PTY readiness path so boot waits for a real readiness EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing the interim 10s→30s band-aid. P1 — bind early, serve late. Split exec_server/pty_server into bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener) (the accept loop). run_init now binds both vsock listeners on the main thread right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up and the container fork, then spawns the accept loops after the fork (Step 8). Binding adds no thread, so the single-threaded-at-fork invariant that keeps spawn_isolated safe is preserved. The listen backlog is filled from boot, so a host connect QUEUES instead of being refused — this removes the `run -it` PTY "Connection refused". CLOEXEC keeps the forked container from inheriting the listeners. P2 — event-driven, liveness-bounded readiness. Early binding makes the host `connect` succeed immediately, so heartbeat()'s (timeout-less) read would block until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt (tokio timeout), return at once when the VM exits (has_exited, zombie-aware — fast-exit containers never stall), and treat a large absolute cap purely as a backstop against a wedged-but-alive guest. A healthy guest passes the heartbeat the moment its accept loop runs, however late in a slow cold boot — so the false "heartbeat failed" warning is gone without a fixed budget to outrun. Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default → `/bin/sh`, and the stale resolve_oci_entrypoint doc comment. Deferred: an explicit guest→host "ready" beacon on a new vsock port was considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a guest connect-out, which contradicts the assumed listen=false direction for guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat achieves the same correctness without guessing cross-process vsock semantics. Supersedes the interim 30s fix (PR #14).

…exit codes guest-init runs as PID 1 but only waited on the container pid, so reparented grandchildren and the sidecar were never reaped and accumulated as zombies for the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with the exec/PTY handlers, which waitpid their own children to read the real exit code — a stolen child makes the handler see ECHILD and report a bogus exit 0 (exec_server.rs). That tension is exactly why a prior fix narrowed the loop to waitpid(container_pid), trading the zombie leak for correct exec codes. Resolve both with a small reaper registry: - New `reaper` module: handlers mark their child pid MANAGED across the spawn (the lock is held across fork, closing the spawn/register race for fast-exiting commands like `exec -- false`); an RAII guard unregisters on every return path. - The supervision loop now peeks exited children non-destructively with `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM lifecycle, unchanged); MANAGED children -> left for their handler to reap (real exit codes preserved); everything else (orphans + sidecar) -> reaped here. - exec one-shot + streaming spawns and the PTY fork register their children; their existing waitpid/try_wait paths are unchanged. Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper loop" comments true again, with no regression to exec/PTY or container exit codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes + orphan reaping. Builds on P1+P2 (issue #3).

* refactor(init): early-bind vsock servers + event-driven readiness (issue #3) Restructures the exec/PTY readiness path so boot waits for a real readiness EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing the interim 10s→30s band-aid. P1 — bind early, serve late. Split exec_server/pty_server into bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener) (the accept loop). run_init now binds both vsock listeners on the main thread right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up and the container fork, then spawns the accept loops after the fork (Step 8). Binding adds no thread, so the single-threaded-at-fork invariant that keeps spawn_isolated safe is preserved. The listen backlog is filled from boot, so a host connect QUEUES instead of being refused — this removes the `run -it` PTY "Connection refused". CLOEXEC keeps the forked container from inheriting the listeners. P2 — event-driven, liveness-bounded readiness. Early binding makes the host `connect` succeed immediately, so heartbeat()'s (timeout-less) read would block until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt (tokio timeout), return at once when the VM exits (has_exited, zombie-aware — fast-exit containers never stall), and treat a large absolute cap purely as a backstop against a wedged-but-alive guest. A healthy guest passes the heartbeat the moment its accept loop runs, however late in a slow cold boot — so the false "heartbeat failed" warning is gone without a fixed budget to outrun. Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default → `/bin/sh`, and the stale resolve_oci_entrypoint doc comment. Deferred: an explicit guest→host "ready" beacon on a new vsock port was considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a guest connect-out, which contradicts the assumed listen=false direction for guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat achieves the same correctness without guessing cross-process vsock semantics. Supersedes the interim 30s fix (PR #14). * fix(init): real PID1 reaper — reap orphans without stealing exec/PTY exit codes guest-init runs as PID 1 but only waited on the container pid, so reparented grandchildren and the sidecar were never reaped and accumulated as zombies for the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with the exec/PTY handlers, which waitpid their own children to read the real exit code — a stolen child makes the handler see ECHILD and report a bogus exit 0 (exec_server.rs). That tension is exactly why a prior fix narrowed the loop to waitpid(container_pid), trading the zombie leak for correct exec codes. Resolve both with a small reaper registry: - New `reaper` module: handlers mark their child pid MANAGED across the spawn (the lock is held across fork, closing the spawn/register race for fast-exiting commands like `exec -- false`); an RAII guard unregisters on every return path. - The supervision loop now peeks exited children non-destructively with `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM lifecycle, unchanged); MANAGED children -> left for their handler to reap (real exit codes preserved); everything else (orphans + sidecar) -> reaped here. - exec one-shot + streaming spawns and the PTY fork register their children; their existing waitpid/try_wait paths are unchanged. Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper loop" comments true again, with no regression to exec/PTY or container exit codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes + orphan reaping. Builds on P1+P2 (issue #3). * docs: P2 deferred-main-spawn design (GO-WITH-CONDITIONS) Adversarial mapping of the #15+#18 base resolved both crux uncertainties: console logs come free via process-wide fd inheritance (Stdio::inherit, not the exec path's piped), and the multi-threaded fork hazard is avoided by spawning the deferred main via Command::spawn (not spawn_isolated's raw fork; the VM already isolates). Conditions: single spawn-main (CAS) + atomic late container-pid handoff to the reaper. Includes risk-ranked blockers + a 7-phase plan whose Phase 0 is a single KVM prototype that de-risks the whole feature. --------- Co-authored-by: Roy Lin <roylin@a3s.box>

Roy Lin added 2 commits June 11, 2026 16:31

ZhiXiao-Lin force-pushed the refactor/init-readiness branch from e8b1824 to f1db2cf Compare June 11, 2026 08:31

ZhiXiao-Lin merged commit dad756e into main Jun 11, 2026
7 checks passed

ZhiXiao-Lin deleted the refactor/init-readiness branch June 11, 2026 08:38

ZhiXiao-Lin mentioned this pull request Jun 11, 2026

docs: P2 deferred-main-spawn design (GO-WITH-CONDITIONS) #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(init): event-driven readiness + early-bind vsock + real PID1 reaper (issue #3)#15

refactor(init): event-driven readiness + early-bind vsock + real PID1 reaper (issue #3)#15
ZhiXiao-Lin merged 2 commits into
mainfrom
refactor/init-readiness

ZhiXiao-Lin commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhiXiao-Lin commented Jun 11, 2026

Why

What (3 commits)

Testing (real KVM + unit)

Honest caveat

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant