Skip to content

perf: CoW memory dedup (KSM) + boot-floor trim + reflink rootfs copy#16

Merged
ZhiXiao-Lin merged 3 commits into
mainfrom
feat/cow-ksm-memory-merge
Jun 11, 2026
Merged

perf: CoW memory dedup (KSM) + boot-floor trim + reflink rootfs copy#16
ZhiXiao-Lin merged 3 commits into
mainfrom
feat/cow-ksm-memory-merge

Conversation

@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor

First, pragmatic wave of the CoW optimization plan for the AI-agent-sandbox use case (many short-lived microVMs). Inspired by forkd (Firecracker snapshot+UFFD fork) and clone (KVM "shadow clone" + KSM + balloon) — but those need a snapshot-capable VMM, and libkrun has no snapshot/restore/fork (verified: 75 krun_* exports, 0 for snapshot; guest RAM is anonymous-private, not memfd; krun_start_enter takes over the process). So this wave delivers the wins that don't require forking libkrun. (True snapshot-fork is a separate strategic epic — a second snapshot-capable backend behind the existing backend trait.)

Tier 2 — KSM page-merging (the "100 VMs ≈ 10" memory density)

shim: opt-in prctl(PR_SET_MEMORY_MERGE) (Linux 6.4+) gated by A3S_BOX_KSM=1, marking libkrun's guest RAM KSM-mergeable. With KSM enabled on the host, ksmd dedups identical pages (kernel text, common libs) across same-image microVMs — VMM-agnostic, host-side, one line.

Measured on Linux/KVM, 6× same-image (ml:t) VMs:

sum PSS (real physical RAM) KSM pages_sharing
KSM off 355 MB 0
A3S_BOX_KSM=1 108 MB 60,579 pages (~236 MB deduped)

~3.2× less memory (69% saved); 6,862 unique pages backed 60,579 sharers (~8.8:1). Per-VM marginal cost collapses. Best-effort: no-op when unset or on pre-6.4 kernels.

Tier 1 — boot-floor trim + FS-level CoW

  • wait_for_vm_running: the fixed 1000 ms "stabilize" sleep ran on every boot before the readiness probe started. Now a 250 ms has_exited poll — still fails loudly on an immediate launch crash (libkrun exits in ms on bad config), later crashes still caught by wait_for_exec_ready. Measured: avg boot-to-ready 1475 ms → 1139 ms (~336 ms / 23% faster) on a fast host; approaches the full ~750 ms on warmer guests; never slower.
  • layer_cache.rs: the CopyProvider fallback (when overlayfs is unavailable) did a full per-file byte copy. Added copy_file_cow — prefers a FICLONE reflink so a new box's rootfs shares blocks CoW with the cached image on btrfs/XFS(reflink)/bcachefs (instant, no extra disk), falling back to a byte copy on ext4/cross-device; permissions preserved. (Overlay is the default on Linux, so this only affects the fallback path — not exercised on the overlay test host, but compile-checked + safe fallback.)

Notes / follow-ups

  • KSM is opt-in via env; a config field + auto-managing /sys/kernel/mm/ksm/run + per-image warm-pool wiring are natural next steps. KSM has CPU (ksmd) and side-channel tradeoffs — fine for same-tenant agent sandboxes; scope per-tenant for multi-tenant.
  • Verified on x86_64 Linux/KVM (kernel 6.8). KSM POC needs host ksm/run=1.

ZhiXiao-Lin pushed a commit that referenced this pull request Jun 11, 2026
Design-only proposal: a second snapshot-capable backend (Firecracker/CH) behind
the existing backend trait for high-fan-out agent sandboxes, since libkrun has no
snapshot. Maps forkd (snapshot+UFFD+diff-chaining) and clone (shadow-clone+KSM)
onto a3s-box, reusing the overlay rootfs CoW, warm pool, and — the key advantage —
the existing guest vsock exec server as the post-fork re-injection hook. Phased
plan: P0 KSM/boot-trim (done, #16) → P1 per-image template pool + re-inject
handshake (doable on libkrun, de-risks) → P2 real snapshot-fork backend → P3
diff-chaining + balloon.
ZhiXiao-Lin added a commit that referenced this pull request Jun 11, 2026
Design-only proposal: a second snapshot-capable backend (Firecracker/CH) behind
the existing backend trait for high-fan-out agent sandboxes, since libkrun has no
snapshot. Maps forkd (snapshot+UFFD+diff-chaining) and clone (shadow-clone+KSM)
onto a3s-box, reusing the overlay rootfs CoW, warm pool, and — the key advantage —
the existing guest vsock exec server as the post-fork re-injection hook. Phased
plan: P0 KSM/boot-trim (done, #16) → P1 per-image template pool + re-inject
handshake (doable on libkrun, de-risks) → P2 real snapshot-fork backend → P3
diff-chaining + balloon.

Co-authored-by: Roy Lin <roylin@a3s.box>
Roy Lin added 3 commits June 11, 2026 16:39
Mark the shim's anonymous memory — including libkrun's guest RAM — KSM-mergeable
via prctl(PR_SET_MEMORY_MERGE) (Linux 6.4+) when A3S_BOX_KSM=1. With KSM enabled
on the host, identical pages across same-image microVMs (kernel text, common
runtime/libs) are deduplicated by ksmd, so N warm VMs of one image cost far less
host RAM than N× their size — the memory-density half of a CoW-fork model,
without any libkrun change. Best-effort: no-op when unset or on pre-6.4 kernels.

Tier 2 of the CoW optimization plan (libkrun has no VM snapshot/fork; KSM is the
VMM-agnostic, host-side path to clone-style '100 VMs ~ 10' memory density).
…lback

Tier 1 of the CoW/boot-latency work.

- ready.rs wait_for_vm_running: the fixed 1000 ms "stabilize" sleep ran on EVERY
  boot before the readiness probe even started. Replace it with a 250 ms
  has_exited poll — still catches an immediate launch failure (bad config makes
  libkrun exit in milliseconds) and fails loudly, but shaves ~750 ms off every
  boot. Later crashes are still caught by wait_for_exec_ready's has_exited checks.

- layer_cache.rs copy_dir_recursive: the CopyProvider fallback (used when
  overlayfs is unavailable) did a full per-file byte copy. Add copy_file_cow,
  which prefers a FICLONE reflink so a new box's rootfs shares blocks CoW with the
  cached image on btrfs/XFS(reflink)/bcachefs (instant, no extra disk), falling
  back to a byte copy on ext4/cross-device. Permissions preserved like fs::copy.
Tests the reflink-preferring copy helper on the fallback path (any FS): content
and permission bits survive whether FICLONE reflinks or fs::copy is used, and the
destination is truncated/overwritten.
@ZhiXiao-Lin ZhiXiao-Lin force-pushed the feat/cow-ksm-memory-merge branch from 58e52b5 to 8cb8423 Compare June 11, 2026 08:40
@ZhiXiao-Lin ZhiXiao-Lin merged commit 0f224a4 into main Jun 11, 2026
7 checks passed
@ZhiXiao-Lin ZhiXiao-Lin deleted the feat/cow-ksm-memory-merge branch June 11, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant