perf: CoW memory dedup (KSM) + boot-floor trim + reflink rootfs copy#16
Merged
Conversation
This was referenced Jun 11, 2026
ZhiXiao-Lin
pushed a commit
that referenced
this pull request
Jun 11, 2026
Design-only proposal: a second snapshot-capable backend (Firecracker/CH) behind the existing backend trait for high-fan-out agent sandboxes, since libkrun has no snapshot. Maps forkd (snapshot+UFFD+diff-chaining) and clone (shadow-clone+KSM) onto a3s-box, reusing the overlay rootfs CoW, warm pool, and — the key advantage — the existing guest vsock exec server as the post-fork re-injection hook. Phased plan: P0 KSM/boot-trim (done, #16) → P1 per-image template pool + re-inject handshake (doable on libkrun, de-risks) → P2 real snapshot-fork backend → P3 diff-chaining + balloon.
ZhiXiao-Lin
added a commit
that referenced
this pull request
Jun 11, 2026
Design-only proposal: a second snapshot-capable backend (Firecracker/CH) behind the existing backend trait for high-fan-out agent sandboxes, since libkrun has no snapshot. Maps forkd (snapshot+UFFD+diff-chaining) and clone (shadow-clone+KSM) onto a3s-box, reusing the overlay rootfs CoW, warm pool, and — the key advantage — the existing guest vsock exec server as the post-fork re-injection hook. Phased plan: P0 KSM/boot-trim (done, #16) → P1 per-image template pool + re-inject handshake (doable on libkrun, de-risks) → P2 real snapshot-fork backend → P3 diff-chaining + balloon. Co-authored-by: Roy Lin <roylin@a3s.box>
added 3 commits
June 11, 2026 16:39
Mark the shim's anonymous memory — including libkrun's guest RAM — KSM-mergeable via prctl(PR_SET_MEMORY_MERGE) (Linux 6.4+) when A3S_BOX_KSM=1. With KSM enabled on the host, identical pages across same-image microVMs (kernel text, common runtime/libs) are deduplicated by ksmd, so N warm VMs of one image cost far less host RAM than N× their size — the memory-density half of a CoW-fork model, without any libkrun change. Best-effort: no-op when unset or on pre-6.4 kernels. Tier 2 of the CoW optimization plan (libkrun has no VM snapshot/fork; KSM is the VMM-agnostic, host-side path to clone-style '100 VMs ~ 10' memory density).
…lback Tier 1 of the CoW/boot-latency work. - ready.rs wait_for_vm_running: the fixed 1000 ms "stabilize" sleep ran on EVERY boot before the readiness probe even started. Replace it with a 250 ms has_exited poll — still catches an immediate launch failure (bad config makes libkrun exit in milliseconds) and fails loudly, but shaves ~750 ms off every boot. Later crashes are still caught by wait_for_exec_ready's has_exited checks. - layer_cache.rs copy_dir_recursive: the CopyProvider fallback (used when overlayfs is unavailable) did a full per-file byte copy. Add copy_file_cow, which prefers a FICLONE reflink so a new box's rootfs shares blocks CoW with the cached image on btrfs/XFS(reflink)/bcachefs (instant, no extra disk), falling back to a byte copy on ext4/cross-device. Permissions preserved like fs::copy.
Tests the reflink-preferring copy helper on the fallback path (any FS): content and permission bits survive whether FICLONE reflinks or fs::copy is used, and the destination is truncated/overwritten.
58e52b5 to
8cb8423
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First, pragmatic wave of the CoW optimization plan for the AI-agent-sandbox use case (many short-lived microVMs). Inspired by forkd (Firecracker snapshot+UFFD fork) and clone (KVM "shadow clone" + KSM + balloon) — but those need a snapshot-capable VMM, and libkrun has no snapshot/restore/fork (verified: 75
krun_*exports, 0 for snapshot; guest RAM is anonymous-private, not memfd;krun_start_entertakes over the process). So this wave delivers the wins that don't require forking libkrun. (True snapshot-fork is a separate strategic epic — a second snapshot-capable backend behind the existing backend trait.)Tier 2 — KSM page-merging (the "100 VMs ≈ 10" memory density)
shim: opt-inprctl(PR_SET_MEMORY_MERGE)(Linux 6.4+) gated byA3S_BOX_KSM=1, marking libkrun's guest RAM KSM-mergeable. With KSM enabled on the host, ksmd dedups identical pages (kernel text, common libs) across same-image microVMs — VMM-agnostic, host-side, one line.Measured on Linux/KVM, 6× same-image (
ml:t) VMs:A3S_BOX_KSM=1→ ~3.2× less memory (69% saved); 6,862 unique pages backed 60,579 sharers (~8.8:1). Per-VM marginal cost collapses. Best-effort: no-op when unset or on pre-6.4 kernels.
Tier 1 — boot-floor trim + FS-level CoW
wait_for_vm_running: the fixed 1000 ms "stabilize" sleep ran on every boot before the readiness probe started. Now a 250 mshas_exitedpoll — still fails loudly on an immediate launch crash (libkrun exits in ms on bad config), later crashes still caught bywait_for_exec_ready. Measured: avg boot-to-ready 1475 ms → 1139 ms (~336 ms / 23% faster) on a fast host; approaches the full ~750 ms on warmer guests; never slower.layer_cache.rs: theCopyProviderfallback (when overlayfs is unavailable) did a full per-file byte copy. Addedcopy_file_cow— prefers aFICLONEreflink so a new box's rootfs shares blocks CoW with the cached image on btrfs/XFS(reflink)/bcachefs (instant, no extra disk), falling back to a byte copy on ext4/cross-device; permissions preserved. (Overlay is the default on Linux, so this only affects the fallback path — not exercised on the overlay test host, but compile-checked + safe fallback.)Notes / follow-ups
/sys/kernel/mm/ksm/run+ per-image warm-pool wiring are natural next steps. KSM has CPU (ksmd) and side-channel tradeoffs — fine for same-tenant agent sandboxes; scope per-tenant for multi-tenant.ksm/run=1.