docs: P2 deferred-main-spawn design (GO-WITH-CONDITIONS)#19
Closed
ZhiXiao-Lin wants to merge 3 commits into
Closed
docs: P2 deferred-main-spawn design (GO-WITH-CONDITIONS)#19ZhiXiao-Lin wants to merge 3 commits into
ZhiXiao-Lin wants to merge 3 commits into
Conversation
added 3 commits
June 11, 2026 10:47
…sue #3) Restructures the exec/PTY readiness path so boot waits for a real readiness EVENT bounded by VM liveness, instead of guessing a fixed timeout — replacing the interim 10s→30s band-aid. P1 — bind early, serve late. Split exec_server/pty_server into bind_*()->Listener (pure socket/bind/listen syscalls) and serve_*(listener) (the accept loop). run_init now binds both vsock listeners on the main thread right after the filesystem mounts (Step 2.6), BEFORE the slow network bring-up and the container fork, then spawns the accept loops after the fork (Step 8). Binding adds no thread, so the single-threaded-at-fork invariant that keeps spawn_isolated safe is preserved. The listen backlog is filled from boot, so a host connect QUEUES instead of being refused — this removes the `run -it` PTY "Connection refused". CLOEXEC keeps the forked container from inheriting the listeners. P2 — event-driven, liveness-bounded readiness. Early binding makes the host `connect` succeed immediately, so heartbeat()'s (timeout-less) read would block until the guest's accept loop runs. wait_for_exec_ready is rewritten to bound each connect+heartbeat attempt (tokio timeout), return at once when the VM exits (has_exited, zombie-aware — fast-exit containers never stall), and treat a large absolute cap purely as a backstop against a wedged-but-alive guest. A healthy guest passes the heartbeat the moment its accept loop runs, however late in a slow cold boot — so the false "heartbeat failed" warning is gone without a fixed budget to outrun. Also folds in the issue-#3 cleanups: dead `/sbin/init` BOX_EXEC_EXEC default → `/bin/sh`, and the stale resolve_oci_entrypoint doc comment. Deferred: an explicit guest→host "ready" beacon on a new vsock port was considered but NOT wired — port_forward uses add_vsock_port(listen=true) with a guest connect-out, which contradicts the assumed listen=false direction for guest→host, and that is only verifiable on KVM. The liveness-bounded heartbeat achieves the same correctness without guessing cross-process vsock semantics. Supersedes the interim 30s fix (PR #14).
…exit codes guest-init runs as PID 1 but only waited on the container pid, so reparented grandchildren and the sidecar were never reaped and accumulated as zombies for the VM's lifetime. The earlier code couldn't just waitpid(-1): that races with the exec/PTY handlers, which waitpid their own children to read the real exit code — a stolen child makes the handler see ECHILD and report a bogus exit 0 (exec_server.rs). That tension is exactly why a prior fix narrowed the loop to waitpid(container_pid), trading the zombie leak for correct exec codes. Resolve both with a small reaper registry: - New `reaper` module: handlers mark their child pid MANAGED across the spawn (the lock is held across fork, closing the spawn/register race for fast-exiting commands like `exec -- false`); an RAII guard unregisters on every return path. - The supervision loop now peeks exited children non-destructively with `waitid(WNOWAIT)` and routes: the container -> reap + propagate exit code (VM lifecycle, unchanged); MANAGED children -> left for their handler to reap (real exit codes preserved); everything else (orphans + sidecar) -> reaped here. - exec one-shot + streaming spawns and the PTY fork register their children; their existing waitpid/try_wait paths are unchanged. Fixes the zombie leak and makes the long-standing "reaped by the zombie-reaper loop" comments true again, with no regression to exec/PTY or container exit codes. Unit-tested (reaper registry); needs KVM verification of exec exit codes + orphan reaping. Builds on P1+P2 (issue #3).
Adversarial mapping of the #15+#18 base resolved both crux uncertainties: console logs come free via process-wide fd inheritance (Stdio::inherit, not the exec path's piped), and the multi-threaded fork hazard is avoided by spawning the deferred main via Command::spawn (not spawn_isolated's raw fork; the VM already isolates). Conditions: single spawn-main (CAS) + atomic late container-pid handoff to the reaper. Includes risk-ranked blockers + a 7-phase plan whose Phase 0 is a single KVM prototype that de-risks the whole feature.
e8b1824 to
f1db2cf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design-only (no implementation yet) for P2 — give a pooled sandbox (#18) FULL
boxsemantics by making the command the VM's real container main (exit code via/.a3s_exit_code, stdout/stderr in the json-filelogs), instead of the keepalive+exec MVP's exec-stream output. Stacks on #15 (reaper/readiness) + #18 (pool).Produced by an adversarial mapping workflow over the real #15+#18 base (5 parallel readers → synthesis). Verdict: GO-WITH-CONDITIONS — both crux uncertainties resolved:
container.jsononly becausespawn_isolatedleaves stdout/stderr atStdio::inherit. A deferred main spawned withStdio::inherit(not the exec path'sStdio::piped) inherits the same fds → logs work. No fd-stashing.Command::spawn()(safe async-signal-safe-only child), NOTspawn_isolated's rawfork()(heavy allocating child → deadlock risk in multi-threaded PID 1). The VM already isolates, so nounshare().Conditions: single spawn-main (CAS on container-pid) + atomic late container-pid handoff to the reaper (the issue-#3 class). Full design, risk-ranked blockers, and a 7-phase plan are in
docs/p2-deferred-main-spawn-design.md.Phase 0 is a single KVM prototype that simultaneously proves no fork deadlock + console-log inheritance + exit-code propagation — it de-risks the whole feature before any non-throwaway code lands.