Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions docs/p2-deferred-main-spawn-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Design: P2 — Deferred-Main-Spawn (full box semantics for pooled sandboxes)

Status: **GO-WITH-CONDITIONS** (design + prototype-first). Builds on
`refactor/init-readiness` (PR #15: early-bind + event-driven readiness + PID1
reaper) and `feat/p1-template-pool` (PR #18: the warm-sandbox pool controller).
Derived from an adversarial mapping of the real #15+#18 base.

## 1. Goal

The pool MVP (PR #18) runs a command in a warm VM via the **exec stream**, so its
output comes back over the exec protocol, **not** the json-file `logs`. P2 gives a
pooled sandbox **full `box` semantics** — the command becomes the VM's real
**container main**, so its exit code flows through the normal `<box>/upper/.a3s_exit_code`
path and its stdout/stderr land in `<box>/logs/container.json` exactly like a
normal `box run`. The VM still skips cold boot (it was pre-warmed), but now behaves
like a first-class box.

## 2. Verdict & the two crux realizations

**GO-WITH-CONDITIONS** — both hard problems are tractable on the #15 base:

1. **Console logs are "free" via process-wide fd inheritance.** The shim wires the
libkrun split console at boot (`shim/main.rs` `add_split_console`, fds kept alive
via `mem::forget`); inside the guest, PID 1 holds fds 1/2 = `/dev/console`.
guest-init routes *its own* logs to `/dev/kmsg` to keep the console clean. The
boot main reaches `container.json` today **only** because `namespace::spawn_isolated`
leaves stdout/stderr at the default `Stdio::inherit`. So a deferred main spawned
with `Stdio::inherit` inherits PID 1's console fds and its output flows to
`console.log`/`console.err.log` → the log processor tags it into `container.json`.
**No fd-stashing, dup-to-100, or env-passing of fd numbers is required** — there
is one shared fd table. (The exec path's `Stdio::piped()` is exactly why exec
output does *not* reach the logs today.)

2. **The multi-threaded fork hazard is avoidable.** Do **not** spawn the deferred
main via `spawn_isolated`'s raw `fork()` — its child runs heavy allocating code
(tracing, `fs::metadata`, user resolution) before exec and can deadlock on an
allocator/tracing lock held by another thread of the (deeply multi-threaded)
PID 1. Instead spawn via `std::process::Command::spawn()` — the same clone/exec
primitive the exec server already uses safely — whose child runs **only** the
registered async-signal-safe `pre_exec` hook before `execvp`. The VM itself
provides isolation, so the deferred main needs **no** `unshare()`/PID-namespace,
removing the second fork entirely.

**Conditions:** exactly one spawn-main (CAS on the container-pid sentinel); the
late container-pid is handed to the reaper atomically (register MANAGED → publish
→ drop guard) so it can't be reaped as an orphan (the issue-#3 class of bug).

## 3. Design (end-to-end)

Deferred-main is a **replacement** for the keepalive (it IS the VM's main), not a
companion.

- **Boot IDLE** — `BOX_DEFERRED_MAIN=1` makes `run_init` skip `spawn_isolated`;
the container-pid stays the sentinel `-1` and PID 1 stays alive (the
`wait_for_children` loop already loops). The early-bind (Step 2.6) and accept-loop
(Step 8) are **unchanged**, so host readiness passes IDLE: the heartbeat handler
is a pure protocol handshake with **no** container-pid dependency — so
`BoxState::Ready` already means "exec server live", which is the de-facto contract
today.
- **spawn-main control frame** — the host sends `spawn-main:<json ExecRequest>` on
the exec vsock. The handler funnels the request through a **safe** spawn path:
`std::process::Command` + `Command::spawn()` under `reaper::spawn_managed`, with a
`build_command` variant that leaves stdout/stderr at `Stdio::inherit()` (so the
child inherits PID 1's console fds). Security (seccomp, user resolution, binary
stat) is built in the **parent** before spawn, mirroring `apply_security_before_exec`.
- **Reaper / exit-code handoff** — make the supervision loop's `container_pid` an
`AtomicI32` read each tick. Order is the crux: `spawn_managed` (lock-held-across-
spawn closes the fork/registration race) → `set_container_pid(pid)` **while still
MANAGED** → drop the guard. The `is_managed` branch covers the pre-publish window;
the `pid == container_pid` branch then reaps it, persists `/.a3s_exit_code`
(overlay upper), and `process::exit(code)` halts the VM. The handler replies
`spawn-main-ack` only **after** a successful spawn+publish (so a fork failure is
reported, not lost).
- **Pool integration (#18)** — add `Request::SpawnMain` to `pool.rs`; the daemon
sends spawn-main instead of `vm.exec_command`, waits for VM exit (the existing
teardown owns lifecycle), and reads exit code from `<box>/upper/.a3s_exit_code`
and logs from `<box>/logs/container.json`. `Request::Run` stays for back-compat.

## 4. Risk-ranked blockers (with mitigations)

| sev | blocker | mitigation |
|-----|---------|-----------|
| HIGH | multi-threaded fork deadlock (raw `fork()` + allocating child) | spawn via `Command::spawn()`, not `spawn_isolated`; build seccomp/user/stat in the parent; no `unshare()` (VM isolates) |
| HIGH | late container-pid race → reaped as orphan, exit code lost (issue-#3 class) | `AtomicI32` container-pid read each tick; `spawn_managed` → publish-while-MANAGED → drop guard; guest unit test: spawn-main an immediate-exit cmd, assert real code not 0 |
| HIGH | console logs broken if deferred main reuses the exec path's `Stdio::piped()` | `build_command` variant with `Stdio::inherit()` stdout/stderr; integration test: spawn-main `echo`, assert the line appears stream-tagged in `container.json` |
| MED | readiness-contract drift (Ready = "no container yet") reopens connection-refused races | scope P2 to the **pool path only** (daemon explicitly drives spawn-main then waits); leave normal `box run` on eager boot-spawn; IDLE-Ready is pool-internal |
| MED | multiple spawn-main frames → two mains race to set container-pid / write exit code | CAS container-pid `-1 → pending → pid`; a second concurrent spawn-main gets "main already spawned" |
| LOW | PTY-server `pre_exec` uses `set_var` (not async-signal-safe) | base the deferred spawner on the **exec** path (`Command::spawn`), never the PTY raw-fork path |

## 5. Phased plan (smallest verifiable steps)

0. **PROTOTYPE (throwaway, KVM)** — from inside an exec-server connection thread
(i.e. multi-threaded PID 1), `Command::spawn` `/bin/sh -c 'echo OUT; echo ERR 1>&2;
exit 7'` with `Stdio::inherit`. Verify **at once**: (a) no fork deadlock under
concurrent exec load, (b) OUT/ERR land in `container.json` correctly stream-tagged,
(c) exit 7 propagates via `/.a3s_exit_code`. This one prototype de-risks the whole
feature. **Must run on real KVM** (fork/allocator-lock + virtio-console don't
reproduce on the macOS dev stub).
1. IDLE boot behind `BOX_DEFERRED_MAIN=1`; assert host readiness still reaches Ready
with no container.
2. Convert supervision-loop `container_pid` to `AtomicI32` (set once at boot as
today — no behavior change); regression-test the eager boot main still reaps right.
3. Safe deferred spawner: `Stdio::inherit` `build_command` variant under
`spawn_managed`; CAS sentinel→pending→pid; publish-before-drop; ack on success.
4. Host spawn-main control frame: `send_spawn_main_control_frame` +
`EXEC_CONTROL_SPAWN_MAIN`; `wait_main_exit` (poll `/.a3s_exit_code`) +
`collect_logs` (read `container.json`).
5. Pool integration: `Request::SpawnMain` in `pool.rs`; full e2e — pool spawn-main a
real image entrypoint, assert exit code + json-file logs + clean teardown.
6. Full KVM matrix: issue-#3 before/after readiness, fast-exit (`false`) exit-code,
large-output log flushing, concurrent spawn-main rejection. Docs per repo rule.

## 6. Prototype-first

The single experiment that de-risks everything: a multi-threaded spawn-main that
simultaneously proves **(a)** safe fork via `Command::spawn()` under concurrent exec
load (no deadlock), **(b)** inherited-stdio output → `container.json` correctly
stream-tagged, **(c)** real exit code in `/.a3s_exit_code`. On real KVM only.
110 changes: 76 additions & 34 deletions src/guest/init/src/exec_server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,52 +67,88 @@ const EXEC_CONTROL_FLUSH: &[u8] = b"flush";
/// match the host's `EXEC_FLUSH_ACK` in `runtime/src/grpc/exec.rs`.
const EXEC_FLUSH_ACK: &[u8] = b"flush-ack";

/// Run the exec server, listening on vsock port 4089.
/// A bound, listening exec-server socket — produced by [`bind_exec_server`] and
/// consumed by [`serve_exec_server`].
///
/// On Linux, binds to `AF_VSOCK` with `VMADDR_CID_ANY`.
/// On non-Linux platforms, this is a no-op (development stub).
pub fn run_exec_server() -> Result<(), Box<dyn std::error::Error>> {
info!("Starting exec server on vsock port {}", EXEC_VSOCK_PORT);
/// Splitting bind from serve lets guest-init bind the exec vsock port EARLY on
/// the main thread (pure socket/bind/listen syscalls — no thread spawn, so the
/// later single-threaded container `fork()` stays fork-safe) while the accept
/// loop runs afterwards in its own thread. Binding early fills the listen
/// backlog from the start of boot, so a host connect QUEUES instead of being
/// refused while the slower boot steps (network, container spawn) finish — this
/// removes the "Connection refused" / heartbeat race of issue #3. On non-Linux
/// this is an inert placeholder so callers stay platform-agnostic.
#[cfg(target_os = "linux")]
pub struct ExecListener(std::os::fd::OwnedFd);
#[cfg(not(target_os = "linux"))]
pub struct ExecListener;

/// Bind + listen the exec vsock socket (port 4089). Pure socket syscalls, safe
/// to call on the main thread before the container fork.
pub fn bind_exec_server() -> Result<ExecListener, Box<dyn std::error::Error>> {
#[cfg(target_os = "linux")]
{
run_vsock_server()?;
use nix::sys::socket::{
bind, listen, socket, AddressFamily, Backlog, SockFlag, SockType, VsockAddr,
};
use std::os::fd::AsRawFd;

let sock_fd = socket(
AddressFamily::Vsock,
SockType::Stream,
SockFlag::empty(),
None,
)?;

// Set CLOEXEC manually since SOCK_CLOEXEC isn't available in nix 0.29 on
// macOS — and so the forked container never inherits the listening socket.
unsafe {
libc::fcntl(sock_fd.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC);
}

let addr = VsockAddr::new(libc::VMADDR_CID_ANY, EXEC_VSOCK_PORT);
bind(sock_fd.as_raw_fd(), &addr)?;
listen(&sock_fd, Backlog::new(4)?)?;

info!("Exec server listening on vsock port {}", EXEC_VSOCK_PORT);
Ok(ExecListener(sock_fd))
}

#[cfg(not(target_os = "linux"))]
{
info!("Exec server not available on non-Linux platform (development mode)");
Ok(ExecListener)
}

Ok(())
}

/// Linux vsock server implementation.
#[cfg(target_os = "linux")]
fn run_vsock_server() -> Result<(), Box<dyn std::error::Error>> {
use nix::sys::socket::{
accept, bind, listen, socket, AddressFamily, Backlog, SockFlag, SockType, VsockAddr,
};
use std::os::fd::{AsRawFd, FromRawFd, OwnedFd};
use tracing::error;

let sock_fd = socket(
AddressFamily::Vsock,
SockType::Stream,
SockFlag::empty(),
None,
)?;
/// Run the exec accept loop on an already-bound listener. Intended to run on its
/// own thread for the VM's lifetime; never returns under normal operation.
pub fn serve_exec_server(listener: ExecListener) -> Result<(), Box<dyn std::error::Error>> {
#[cfg(target_os = "linux")]
{
run_accept_loop(listener.0)
}

// Set CLOEXEC manually since SOCK_CLOEXEC isn't available in nix 0.29 on macOS
unsafe {
libc::fcntl(sock_fd.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC);
#[cfg(not(target_os = "linux"))]
{
let _ = listener;
Ok(())
}
}

let addr = VsockAddr::new(libc::VMADDR_CID_ANY, EXEC_VSOCK_PORT);
bind(sock_fd.as_raw_fd(), &addr)?;
listen(&sock_fd, Backlog::new(4)?)?;
/// Bind then serve in one call. Kept for callers that don't need the early-bind
/// split (e.g. tests); guest-init's boot path uses `bind_*` + `serve_*` directly.
pub fn run_exec_server() -> Result<(), Box<dyn std::error::Error>> {
info!("Starting exec server on vsock port {}", EXEC_VSOCK_PORT);
serve_exec_server(bind_exec_server()?)
}

info!("Exec server listening on vsock port {}", EXEC_VSOCK_PORT);
/// The exec server accept loop.
#[cfg(target_os = "linux")]
fn run_accept_loop(sock_fd: std::os::fd::OwnedFd) -> Result<(), Box<dyn std::error::Error>> {
use nix::sys::socket::accept;
use std::os::fd::{AsRawFd, FromRawFd, OwnedFd};
use tracing::error;

loop {
match accept(sock_fd.as_raw_fd()) {
Expand Down Expand Up @@ -715,8 +751,12 @@ fn execute_command(
Err(output) => return output,
};

let mut child = match command.spawn() {
Ok(child) => child,
// Spawn under the reaper registry: the pid is marked MANAGED before the PID 1
// supervision loop can see it, so the loop leaves this child for us to reap
// (and read its real exit code) instead of stealing it. The guard unregisters
// the pid when this function returns (all paths).
let (mut child, _reap_guard) = match crate::reaper::spawn_managed(|| command.spawn()) {
Ok(pair) => pair,
Err(e) => {
return ExecOutput {
stdout: vec![],
Expand Down Expand Up @@ -1007,8 +1047,10 @@ fn execute_command_streaming(
}
};

let mut child = match command.spawn() {
Ok(child) => child,
// Spawn under the reaper registry (see one-shot path) so PID 1 leaves this
// streaming child for us to reap; the guard unregisters on return.
let (mut child, _reap_guard) = match crate::reaper::spawn_managed(|| command.spawn()) {
Ok(pair) => pair,
Err(e) => {
let output = ExecOutput {
stdout: vec![],
Expand Down
1 change: 1 addition & 0 deletions src/guest/init/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ pub mod namespace;
pub mod network;
pub mod port_forward;
pub mod pty_server;
pub mod reaper;
pub mod user;

pub use namespace::{spawn_isolated, NamespaceConfig, NamespaceError};
Expand Down
Loading
Loading