Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,31 @@ All notable changes to A3S Box will be documented in this file.

## [Unreleased]

### Fixed
- **Slow-boot exec/PTY readiness race (`WARN Exec socket appeared but heartbeat
failed`, and `run -it` PTY `Connection refused`) — issue #3.** The guest binds
its exec (vsock 4089) and PTY (vsock 4090) servers only late in boot — after
the virtio-fs pivot, network bring-up, and the container spawn — and it cannot
start them earlier without forking the container while multi-threaded (which
would risk a deadlock in the forked child). On a cold first run on a slow or
loaded host that bind could land past the host's fixed **10 s** readiness
budget, producing a false "heartbeat failed" warning. The host now waits up to
**30 s** for the exec heartbeat in `wait_for_exec_ready`. This also fixes
`run -it`: boot blocks on that exec-readiness wait *before* attaching the PTY,
and the guest brings the exec and PTY servers up back-to-back, so once the exec
heartbeat passes the PTY server is already listening and the existing 10 s PTY
connect retry succeeds. The wait stays cheap for healthy boxes (it returns the
moment the heartbeat passes) and still bails out immediately when the VM exits,
so a fast-exiting container never stalls for the full budget. The two readiness
warnings were also corrected — exec/attach connect on demand, so a timed-out
probe no longer claims "exec will not be available". Note: this addresses the
*timing* race on an otherwise-healthy guest; a hard boot failure where the
guest never binds the server (e.g. `--network` bridge mode when guest eth0
setup fails) is a separate fault and will surface after the wait rather than be
masked.
- Guest-init's defensive `BOX_EXEC_EXEC` default is now `/bin/sh` instead of the
non-existent-on-Alpine `/sbin/init`, matching the runtime's real fallback.

## [2.0.7] — 2026-06-06

### Added
Expand Down
7 changes: 5 additions & 2 deletions src/guest/init/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,11 @@ impl ExecConfig {
/// - BOX_EXEC_ENV_*: container environment variables
/// - BOX_EXEC_WORKDIR: working directory (defaults to "/")
fn from_env() -> Self {
let executable =
std::env::var("BOX_EXEC_EXEC").unwrap_or_else(|_| "/sbin/init".to_string());
// The runtime always sets BOX_EXEC_EXEC when guest-init is PID 1
// (runtime/src/vm/spec.rs), so this default is only a defensive fallback.
// Use /bin/sh — universal across distros — never /sbin/init, which does
// not exist on Alpine and was the original cause of issue #3.
let executable = std::env::var("BOX_EXEC_EXEC").unwrap_or_else(|_| "/bin/sh".to_string());

// Parse args from individual env vars (BOX_EXEC_ARGC + BOX_EXEC_ARG_0..N)
let args: Vec<String> = match std::env::var("BOX_EXEC_ARGC")
Expand Down
28 changes: 22 additions & 6 deletions src/runtime/src/vm/ready.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,17 @@ impl VmManager {
&mut self,
exec_socket_path: &std::path::Path,
) -> Result<()> {
const MAX_WAIT_MS: u64 = 10000;
// The guest binds the exec server only late in boot (after virtio-fs
// pivot, passt network bring-up, and the container spawn — guest-init
// cannot start it earlier without forking the container while
// multi-threaded, which is unsafe). A cold first run on a slow/loaded
// host can push that past the old 10s budget, which surfaced as a false
// "heartbeat failed" warning and, for `run -it`, a PTY connect that gave
// up before the server came up (issue #3). Wait longer — this is cheap
// for healthy boxes (they return as soon as the heartbeat passes) and the
// loop already bails out the moment the VM exits, so a fast-exiting
// container never stalls for the full budget.
const MAX_WAIT_MS: u64 = 30000;
const POLL_INTERVAL_MS: u64 = 200;

tracing::debug!(
Expand All @@ -54,7 +64,10 @@ impl VmManager {
// Phase 1: Wait for socket file to appear
loop {
if start.elapsed().as_millis() >= MAX_WAIT_MS as u128 {
tracing::warn!("Exec socket did not appear, exec will not be available");
tracing::warn!(
timeout_ms = MAX_WAIT_MS,
"Exec socket did not appear within timeout; exec/attach will connect on demand if the guest exposes it"
);
return Ok(());
}

Expand Down Expand Up @@ -82,9 +95,9 @@ impl VmManager {
// container shuts the VM down. The shim becomes a zombie the moment
// the VM halts, so use has_exited (zombie-aware) rather than
// is_running — without this, a container that exits before its first
// heartbeat stalls the whole boot for MAX_WAIT_MS (~10s), which hit
// every short-lived `run` that lost the heartbeat race and every
// monitor restart of a fast-exiting container.
// heartbeat stalls the whole boot for the full MAX_WAIT_MS budget,
// which hit every short-lived `run` that lost the heartbeat race and
// every monitor restart of a fast-exiting container.
if let Some(ref handler) = *self.handler.read().await {
if handler.has_exited() {
tracing::debug!("VM exited before exec server became ready");
Expand Down Expand Up @@ -114,7 +127,10 @@ impl VmManager {
tokio::time::sleep(tokio::time::Duration::from_millis(POLL_INTERVAL_MS)).await;
}

tracing::warn!("Exec socket appeared but heartbeat failed, exec will not be available");
tracing::warn!(
timeout_ms = MAX_WAIT_MS,
"Exec server did not pass a heartbeat within timeout; exec/attach connect on demand and may still succeed once the guest finishes starting"
);
Ok(())
}
}
3 changes: 2 additions & 1 deletion src/runtime/src/vm/spec.rs
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,8 @@ impl VmManager {
/// - If `entrypoint_override` is set, it replaces the OCI ENTRYPOINT
/// - If ENTRYPOINT is set: executable = ENTRYPOINT[0], args = ENTRYPOINT[1:] + CMD
/// - If only CMD is set: executable = CMD[0], args = CMD[1:]
/// - If neither: fall back to `/sbin/init`
/// - If neither: fall back to `/bin/sh` (universal across distros; `/sbin/init`
/// does not exist on Alpine, which was the original cause of issue #3)
/// - If `cmd_override` is non-empty, it replaces the OCI CMD
///
/// Paths are used as-is since the OCI image is always extracted at rootfs root.
Expand Down
Loading