Skip to content

refactor(init): event-driven readiness + early-bind vsock + real PID1 reaper (issue #3)#15

Merged
ZhiXiao-Lin merged 2 commits into
mainfrom
refactor/init-readiness
Jun 11, 2026
Merged

refactor(init): event-driven readiness + early-bind vsock + real PID1 reaper (issue #3)#15
ZhiXiao-Lin merged 2 commits into
mainfrom
refactor/init-readiness

Conversation

@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor

The proper, structural fix for #3 (the recurring WARN Exec socket appeared but heartbeat failed / run -it Connection refused), plus the PID-1 reaper that the same area needs. Supersedes #14 (the interim 10s→30s budget bump): this removes that band-aid and replaces it with an event-driven, VM-liveness-bounded wait.

Why

On a slow/loaded cold boot, guest-init binds its exec (vsock 4089) and PTY (4090) servers only late in boot (after virtio-fs pivot, network bring-up, container spawn), so the bind could land past the host's fixed readiness budget → false warning + -it race. It can't bind earlier naively: spawn_isolated runs non-async-signal-safe code before exec, safe only because the parent is single-threaded at fork. Separately, guest-init (as PID 1) only waited on the container pid, so reparented orphans + the sidecar were never reaped.

What (3 commits)

P1 — bind early, serve late. Split exec/PTY servers into bind() (pure socket/bind/listen syscalls, on the main thread before the network + fork — fork-safe, fills the listen backlog so a host connect queues instead of being refused) and serve() (accept loop, spawned after the fork). Removes the -it "Connection refused".

P2 — event-driven, liveness-bounded readiness. Early binding makes the host connect succeed and heartbeat()'s (timeout-less) read block until the guest accepts, so wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt, return at once when the VM exits (has_exited — fast-exit containers never stall), and use a large cap only as a backstop. A healthy guest passes the heartbeat the moment its accept loop runs, however slow the boot — no fixed budget to outrun. Also: dead /sbin/init/bin/sh default + stale doc comment.

P3 — real PID 1 reaper. New reaper registry: handlers mark their child pid MANAGED across the spawn (lock held across fork closes the spawn/register race for fast-exiting commands); the supervision loop peeks exited children non-destructively with waitid(WNOWAIT) and routes — container → reap+propagate exit code; MANAGED → left for the handler (real exec/PTY codes preserved); orphans + sidecar → reaped here. Fixes the zombie leak without the waitpid(-1) race that would corrupt exec codes (an ECHILD'd handler reports exit 0).

Testing (real KVM + unit)

  • Unit: ~2300 workspace lib tests pass (0 fail), incl. new reaper registry tests.
  • Real-VM core_smoke: 12/14 pass (incl. interactive_pty, foreground_exit_code, lifecycle, pause/kill/wait, ports, compose, volumes); the 2 failures (network_connect_disconnect_before_start, restart_policy_monitor_recovers_dead_box) fail identically on the clean baseline — pre-existing, unrelated to this change. host_smoke command matrix passes.
  • Issue WARN Exec socket appeared but heartbeat failed, exec will not be available #3 before/after: with a 15s boot delay, the old code emits the heartbeat failed warning at 10s and exec fails (Connection reset); this branch waits out the 17s boot, no warning, exec works.
  • P3 exit codes (the risk): container exit 7→7; exec exit 42→42, false→1, true→0, exit 3→3 — no regression. Reaper keeps 0 zombies.

Honest caveat

I could not empirically reproduce the P3 zombie leak: shell-created orphans got reaped within the container tree (busybox ash reaps its subshell bg jobs) on both old and new code. P3 is architecturally correct (old code provably never waitpid(-1)s, so a genuine double-fork daemon's reparented grandchild + the sidecar would leak) and verified to not regress exit codes, but its practical impact is lower than the analysis implied. P2's fix is the high-value, verified one.

Built/verified on x86_64 Linux/KVM; the suite needs A3S_REGISTRY_MIRRORS (Docker Hub egress was blocked in the test env) and a3s-box-shim built in the same profile.

Roy Lin added 2 commits June 11, 2026 16:31
…sue #3)

Restructures the exec/PTY readiness path so boot waits for a real readiness
EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing
the interim 10s→30s band-aid.

P1 — bind early, serve late. Split exec_server/pty_server into
bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener)
(the accept loop). run_init now binds both vsock listeners on the main thread
right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up
and the container fork, then spawns the accept loops after the fork (Step 8).
Binding adds no thread, so the single-threaded-at-fork invariant that keeps
spawn_isolated safe is preserved. The listen backlog is filled from boot, so a
host connect QUEUES instead of being refused — this removes the `run -it` PTY
"Connection refused". CLOEXEC keeps the forked container from inheriting the
listeners.

P2 — event-driven, liveness-bounded readiness. Early binding makes the host
`connect` succeed immediately, so heartbeat()'s (timeout-less) read would block
until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound
each connect+heartbeat attempt (tokio timeout), return at once when the VM
exits (has_exited, zombie-aware — fast-exit containers never stall), and treat
a large absolute cap purely as a backstop against a wedged-but-alive guest. A
healthy guest passes the heartbeat the moment its accept loop runs, however
late in a slow cold boot — so the false "heartbeat failed" warning is gone
without a fixed budget to outrun.

Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default →
`/bin/sh`, and the stale resolve_oci_entrypoint doc comment.

Deferred: an explicit guest→host "ready" beacon on a new vsock port was
considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a
guest connect-out, which contradicts the assumed listen=false direction for
guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat
achieves the same correctness without guessing cross-process vsock semantics.
Supersedes the interim 30s fix (PR #14).
…exit codes

guest-init runs as PID 1 but only waited on the container pid, so reparented
grandchildren and the sidecar were never reaped and accumulated as zombies for
the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with
the exec/PTY handlers, which waitpid their own children to read the real exit
code — a stolen child makes the handler see ECHILD and report a bogus exit 0
(exec_server.rs). That tension is exactly why a prior fix narrowed the loop to
waitpid(container_pid), trading the zombie leak for correct exec codes.

Resolve both with a small reaper registry:
- New `reaper` module: handlers mark their child pid MANAGED across the spawn
  (the lock is held across fork, closing the spawn/register race for fast-exiting
  commands like `exec -- false`); an RAII guard unregisters on every return path.
- The supervision loop now peeks exited children non-destructively with
  `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM
  lifecycle, unchanged); MANAGED children -> left for their handler to reap (real
  exit codes preserved); everything else (orphans + sidecar) -> reaped here.
- exec one-shot + streaming spawns and the PTY fork register their children;
  their existing waitpid/try_wait paths are unchanged.

Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper
loop" comments true again, with no regression to exec/PTY or container exit
codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes
+ orphan reaping. Builds on P1+P2 (issue #3).
@ZhiXiao-Lin ZhiXiao-Lin force-pushed the refactor/init-readiness branch from e8b1824 to f1db2cf Compare June 11, 2026 08:31
@ZhiXiao-Lin ZhiXiao-Lin merged commit dad756e into main Jun 11, 2026
7 checks passed
@ZhiXiao-Lin ZhiXiao-Lin deleted the refactor/init-readiness branch June 11, 2026 08:38
ZhiXiao-Lin added a commit that referenced this pull request Jun 11, 2026
* refactor(init): early-bind vsock servers + event-driven readiness (issue #3)

Restructures the exec/PTY readiness path so boot waits for a real readiness
EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing
the interim 10s→30s band-aid.

P1 — bind early, serve late. Split exec_server/pty_server into
bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener)
(the accept loop). run_init now binds both vsock listeners on the main thread
right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up
and the container fork, then spawns the accept loops after the fork (Step 8).
Binding adds no thread, so the single-threaded-at-fork invariant that keeps
spawn_isolated safe is preserved. The listen backlog is filled from boot, so a
host connect QUEUES instead of being refused — this removes the `run -it` PTY
"Connection refused". CLOEXEC keeps the forked container from inheriting the
listeners.

P2 — event-driven, liveness-bounded readiness. Early binding makes the host
`connect` succeed immediately, so heartbeat()'s (timeout-less) read would block
until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound
each connect+heartbeat attempt (tokio timeout), return at once when the VM
exits (has_exited, zombie-aware — fast-exit containers never stall), and treat
a large absolute cap purely as a backstop against a wedged-but-alive guest. A
healthy guest passes the heartbeat the moment its accept loop runs, however
late in a slow cold boot — so the false "heartbeat failed" warning is gone
without a fixed budget to outrun.

Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default →
`/bin/sh`, and the stale resolve_oci_entrypoint doc comment.

Deferred: an explicit guest→host "ready" beacon on a new vsock port was
considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a
guest connect-out, which contradicts the assumed listen=false direction for
guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat
achieves the same correctness without guessing cross-process vsock semantics.
Supersedes the interim 30s fix (PR #14).

* fix(init): real PID1 reaper — reap orphans without stealing exec/PTY exit codes

guest-init runs as PID 1 but only waited on the container pid, so reparented
grandchildren and the sidecar were never reaped and accumulated as zombies for
the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with
the exec/PTY handlers, which waitpid their own children to read the real exit
code — a stolen child makes the handler see ECHILD and report a bogus exit 0
(exec_server.rs). That tension is exactly why a prior fix narrowed the loop to
waitpid(container_pid), trading the zombie leak for correct exec codes.

Resolve both with a small reaper registry:
- New `reaper` module: handlers mark their child pid MANAGED across the spawn
  (the lock is held across fork, closing the spawn/register race for fast-exiting
  commands like `exec -- false`); an RAII guard unregisters on every return path.
- The supervision loop now peeks exited children non-destructively with
  `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM
  lifecycle, unchanged); MANAGED children -> left for their handler to reap (real
  exit codes preserved); everything else (orphans + sidecar) -> reaped here.
- exec one-shot + streaming spawns and the PTY fork register their children;
  their existing waitpid/try_wait paths are unchanged.

Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper
loop" comments true again, with no regression to exec/PTY or container exit
codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes
+ orphan reaping. Builds on P1+P2 (issue #3).

* docs: P2 deferred-main-spawn design (GO-WITH-CONDITIONS)

Adversarial mapping of the #15+#18 base resolved both crux uncertainties:
console logs come free via process-wide fd inheritance (Stdio::inherit, not the
exec path's piped), and the multi-threaded fork hazard is avoided by spawning the
deferred main via Command::spawn (not spawn_isolated's raw fork; the VM already
isolates). Conditions: single spawn-main (CAS) + atomic late container-pid handoff
to the reaper. Includes risk-ranked blockers + a 7-phase plan whose Phase 0 is a
single KVM prototype that de-risks the whole feature.

---------

Co-authored-by: Roy Lin <roylin@a3s.box>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant