Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,11 @@ type SystemUserConfig struct {

// AutoLifecycleConfig controls automatic sandbox lifecycle transitions
type AutoLifecycleConfig struct {
Enabled bool
PauseAfterIdleSec int // auto-pause after N seconds of inactivity (default: 60)
StopAfterPausedSec int // auto-stop after N seconds of being paused (default: 900)
DeleteAfterStoppedSec int // auto-delete after N seconds of being stopped (default: 604800)
CheckIntervalSec int // how often the manager scans (default: 30)
Enabled bool
SnapshotAfterIdleSec int // auto-snapshot after N seconds of inactivity (default: 60)
DeleteAfterSnapshottedSec int // auto-delete after N seconds of being snapshotted (default: 604800)
CheckIntervalSec int // how often the manager scans (default: 30)
Concurrency int // max concurrent snapshot/delete operations (default: 10)
}

// Config holds all application configuration
Expand Down Expand Up @@ -214,11 +214,11 @@ const (
DefaultRedisPassword = ""
DefaultRedisDB = 0
// Auto-lifecycle defaults
DefaultAutoLifecycleEnabled = true
DefaultAutoLifecyclePauseAfterIdleSec = 60 // 1 minute
DefaultAutoLifecycleStopAfterPausedSec = 300 // 5 minutes
DefaultAutoLifecycleDeleteAfterStoppedSec = 604800 // 1 week
DefaultAutoLifecycleCheckIntervalSec = 30 // 30 seconds
DefaultAutoLifecycleEnabled = true
DefaultAutoLifecycleSnapshotAfterIdleSec = 60 // 1 minute
DefaultAutoLifecycleDeleteAfterSnapshottedSec = 604800 // 1 week
DefaultAutoLifecycleCheckIntervalSec = 30 // 30 seconds
DefaultAutoLifecycleConcurrency = 10
// Monitor defaults
DefaultMonitorEnabled = true
// Pagination defaults
Expand Down Expand Up @@ -317,11 +317,11 @@ func New() *Config {
MaxAgeSec: getEnvInt("CORS_MAX_AGE_SEC", DefaultCORSMaxAgeSec),
},
AutoLifecycle: AutoLifecycleConfig{
Enabled: getEnvBool("AUTO_LIFECYCLE_ENABLED", DefaultAutoLifecycleEnabled),
PauseAfterIdleSec: getEnvInt("AUTO_LIFECYCLE_PAUSE_AFTER_IDLE_SEC", DefaultAutoLifecyclePauseAfterIdleSec),
StopAfterPausedSec: getEnvInt("AUTO_LIFECYCLE_STOP_AFTER_PAUSED_SEC", DefaultAutoLifecycleStopAfterPausedSec),
DeleteAfterStoppedSec: getEnvInt("AUTO_LIFECYCLE_DELETE_AFTER_STOPPED_SEC", DefaultAutoLifecycleDeleteAfterStoppedSec),
CheckIntervalSec: getEnvInt("AUTO_LIFECYCLE_CHECK_INTERVAL_SEC", DefaultAutoLifecycleCheckIntervalSec),
Enabled: getEnvBool("AUTO_LIFECYCLE_ENABLED", DefaultAutoLifecycleEnabled),
SnapshotAfterIdleSec: getEnvInt("AUTO_LIFECYCLE_SNAPSHOT_AFTER_IDLE_SEC", DefaultAutoLifecycleSnapshotAfterIdleSec),
DeleteAfterSnapshottedSec: getEnvInt("AUTO_LIFECYCLE_DELETE_AFTER_SNAPSHOTTED_SEC", DefaultAutoLifecycleDeleteAfterSnapshottedSec),
CheckIntervalSec: getEnvInt("AUTO_LIFECYCLE_CHECK_INTERVAL_SEC", DefaultAutoLifecycleCheckIntervalSec),
Concurrency: getEnvInt("AUTO_LIFECYCLE_CONCURRENCY", DefaultAutoLifecycleConcurrency),
},
Monitor: MonitorConfig{
Enabled: getEnvBool("MONITOR_ENABLED", DefaultMonitorEnabled),
Expand Down
216 changes: 216 additions & 0 deletions docs/snapshot-restore-scale-security-review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Snapshot/Restore Scale and Security Review

Date: 2026-06-19
Branch reviewed: `feat/ch-snap-restore`
Scope: local working-tree changes in `voidrun`

## Executive Summary

The snapshot/restore redesign is moving in a useful direction for startup latency and fleet efficiency, but it is not yet ready to be called optimized for scale and security.

The strongest positives are:

- `singleflight` deduplication for concurrent auto-restore calls
- persisted network metadata (`macAddress`, `netnsName`, `tapName`) to make restore deterministic
- bounded lifecycle concurrency for snapshot/delete sweeps

The main blockers are:

1. Restored VMs lose part of the host-side confinement that fresh boots still have.
2. The new DNS firewall rules weaken network isolation because they are inserted before the private-range drops.
3. Auto-restore work is tied to the first caller's request context, which can cause shared restores to fail under load.
4. The public API contract was not updated to match the lifecycle rewrite.
5. The new memory settings may reduce VM density, and there is no evidence in this branch that the trade-off was measured.
6. The repo is not currently green under `go test ./...`.

Verdict: good prototype progress, but not yet production-ready from a scale/security standpoint.

## What Changed

This branch replaces the old `start/stop/pause/resume` flow with a `snapshot/restore` model and updates the service layer to auto-restore snapshotted sandboxes on demand.

Major themes in the diff:

- lifecycle state model changes from `running/paused/stopped` to `running/snapshotted/killed/deleted`
- runtime snapshot creation and restore support added in `runtime/lifecycle.go`
- sandbox service updated to auto-restore via `singleflight`
- lifecycle manager updated to auto-snapshot idle sandboxes and auto-delete old snapshotted sandboxes
- network namespace setup updated to allow DNS only to configured nameservers
- router changed from `/start`, `/stop`, `/pause`, `/resume` to `/snapshot`, `/restore`

## Findings

### 1. High: restore path drops Landlock confinement

Fresh boots still enable both seccomp and Landlock, but the production restore path only re-enables seccomp. That means a restored Cloud Hypervisor process can end up with broader filesystem access than a newly created VM.

Why it matters:

- security posture becomes inconsistent by lifecycle state
- a sandbox that was safe at create-time becomes less isolated after restore
- this is the kind of regression that can be missed in functional testing but matters in a multi-tenant environment

Evidence:

- `runtime/lifecycle.go` fresh create path appends `--seccomp` and `--landlock`
- `runtime/lifecycle.go` restore path appends only `--seccomp`

Recommended fix:

- make restore use the same Landlock policy builder as create
- avoid maintaining two separate security configurations for the same VMM role
- add an automated test that asserts restore and create both include the same confinement flags

### 2. High: DNS allow rules are ordered before the private-range drops

The new rules allow DNS to configured nameservers before the branch drops traffic to metadata and RFC1918 ranges. If a configured nameserver lives in link-local or private space, that allow rule wins.

Why it matters:

- it weakens the current "deny internal networks from the guest" model
- metadata or internal resolver access could be reintroduced through configuration
- the new test already shows the rule order is opposite of the intended policy

Evidence:

- `runtime/network.go` inserts DNS `ACCEPT` rules before the `169.254.169.254`, `10/8`, `172.16/12`, and `192.168/16` drops
- `runtime/network_test.go` fails with `DNS rules should be AFTER the drops`

Recommended fix:

- move DNS allow rules after the metadata/private-network drops, or
- explicitly reject private/link-local nameserver addresses at config validation time
- keep the regression test and require it to pass before merge

### 3. Medium: shared auto-restore is coupled to a caller request context

The `singleflight` dedupe is a good idea, but the shared restore still runs inside the first caller's request context. If that caller disconnects or times out, the restore can be canceled and rolled back for every concurrent waiter.

Why it matters:

- burst traffic to the same sandbox can fail together
- tail latency becomes sensitive to client disconnects and gateway timeouts
- this turns a scale optimization into a reliability hazard under load

Evidence:

- `service/sandbox.go` calls `s.restoreGroup.Do(id, func() { return s.Restore(ctx, orgID, id) })`
- `service/sandbox.go` then uses that same `ctx` in `waitForAgent()`

Recommended fix:

- decouple the restore worker from the first request by using a fresh bounded internal context
- let callers wait on the shared work result, but do not let one caller cancel the whole restore
- consider a per-sandbox in-flight state machine if restore behavior keeps growing

### 4. Medium: API docs and route contract drifted apart

The router now exposes `/snapshot` and `/restore`, but the OpenAPI spec still documents `/start`, `/stop`, `/pause`, and `/resume`. The schema enum also still advertises old states.

Why it matters:

- generated SDKs and external clients will be wrong
- support and product teams can share outdated lifecycle behavior
- integration breakage is likely even if the server code works

Evidence:

- `server/server.go` registers `/snapshot` and `/restore`
- `openapi.yml` still documents `/sandboxes/{id}/start`, `/stop`, `/pause`, `/resume`
- `openapi.yml` still lists lifecycle states including `stopped` and `paused`, not `snapshotted`

Recommended fix:

- update `openapi.yml` in the same change set as route changes
- regenerate any downstream clients after the spec is corrected
- add a lightweight check that route names and OpenAPI paths stay in sync

### 5. Medium: memory settings may reduce density, with no proof of the trade-off

The branch changes memory configuration from shared memory mode to private memory mode on both the API and CLI paths.

Why it matters:

- memory sharing is often important for VM density when many guests share the same base image
- disabling it may be the right compatibility decision for snapshots, but it can reduce host efficiency
- the branch does not include benchmark evidence showing the fleet-level impact is acceptable

Evidence:

- `runtime/lifecycle.go` changes `Shared: true` to `Shared: false`
- `runtime/lifecycle.go` changes CLI memory flags from `size=%dM,shared=on,mergeable=off` to `size=%dM`

Recommended fix:

- document why shared memory had to be disabled
- run before/after density and memory-pressure measurements
- if the change is required for restore correctness, call that out explicitly in docs and rollout notes

### 6. Medium: current branch is not test-clean

The branch currently fails `go test ./...`.

Why it matters:

- merge confidence is lower when a lifecycle rewrite is not validated end to end
- one failure is directly tied to the new network policy behavior
- another failure comes from a helper program that no longer matches current interfaces

Observed failures:

- `runtime/network_test.go` fails because DNS rules are ordered before the deny rules
- `cmd/test-sandbox/main.go` does not compile against the current repository APIs

Recommended fix:

- make the full Go test suite green before merge
- either update `cmd/test-sandbox/main.go` to current interfaces or exclude it from normal package builds if it is only a local experiment

## Scale Assessment

### Improvements

- `singleflight` is the right direction for preventing restore stampedes
- lifecycle manager concurrency caps are a good guardrail for bulk snapshot/delete work
- storing MAC and NetNS metadata should reduce restore-time recomputation and edge cases

### Remaining scale concerns

- restore cancellation is still fragile because it depends on request-scoped context
- restore readiness still relies on tight polling loops and serial post-restore steps
- memory density impact is unknown after disabling shared guest memory
- API contract drift increases rollout cost across SDKs and automation

Overall scale verdict: improved architecture, but not yet proven or hardened for high-concurrency production use.

## Security Assessment

### Improvements

- DNS is now restricted to configured nameservers instead of broad outbound UDP/TCP allowances
- sandbox network metadata is persisted, reducing restore-time guessing
- Cloud Hypervisor lifecycle handling appears more explicit than the earlier warm-start model

### Remaining security concerns

- restore path loses Landlock parity with fresh create
- DNS rule order weakens isolation if nameservers are internal or link-local
- configuration should validate nameservers against forbidden ranges instead of relying only on iptables ordering
- route/spec drift makes it easier for external callers to rely on outdated lifecycle assumptions

Overall security verdict: not ready to claim secure-by-default until restore confinement and firewall ordering are fixed.

## Recommended Next Steps

1. Fix restore-path security parity by reusing the same Landlock policy generation as create.
2. Reorder DNS firewall rules or reject unsafe nameserver addresses during config validation.
3. Decouple `singleflight` restore execution from request-scoped cancellation.
4. Update `openapi.yml` and any generated clients to the new lifecycle model.
5. Benchmark memory density and restore latency before and after the shared-memory change.
6. Get `go test ./...` green and keep the new network regression test in CI.

## Merge Recommendation

Do not merge as-is if the goal is a production-ready scale/security improvement.

This branch is close enough to keep iterating on, but it should clear the restore confinement issue, the firewall ordering issue, and the current test failures before being treated as ready to share as a completed solution rather than an in-progress design.
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ require (
github.com/vishvananda/netlink v1.3.1
go.mongodb.org/mongo-driver v1.16.1
golang.org/x/crypto v0.46.0
golang.org/x/sync v0.19.0
)

require (
Expand Down Expand Up @@ -60,7 +61,6 @@ require (
go.uber.org/mock v0.6.0 // indirect
golang.org/x/arch v0.23.0 // indirect
golang.org/x/net v0.48.0 // indirect
golang.org/x/sync v0.19.0 // indirect
golang.org/x/text v0.33.0 // indirect
google.golang.org/protobuf v1.36.11 // indirect
)
Expand Down
46 changes: 8 additions & 38 deletions handler/sandbox.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,62 +126,32 @@ func (h *SandboxHandler) Delete(c *gin.Context) error {
return nil
}

func (h *SandboxHandler) Start(c *gin.Context) error {
func (h *SandboxHandler) Snapshot(c *gin.Context) error {
id := c.Param("id")

orgID, err := util.GetOrgIDFromContext(c)
if err != nil {
return err
}

if err := h.sandboxService.Start(c.Request.Context(), orgID, id); err != nil {
return util.ErrInternal("Start failed", err)
if err := h.sandboxService.Snapshot(c.Request.Context(), orgID, id); err != nil {
return util.ErrInternal("Snapshot failed", err)
}
c.JSON(http.StatusOK, model.NewSuccessResponse("Sandbox started", nil))
c.JSON(http.StatusOK, model.NewSuccessResponse("Sandbox snapshotted", nil))
return nil
}

func (h *SandboxHandler) Stop(c *gin.Context) error {
func (h *SandboxHandler) Restore(c *gin.Context) error {
id := c.Param("id")

orgID, err := util.GetOrgIDFromContext(c)
if err != nil {
return err
}

if err := h.sandboxService.Stop(c.Request.Context(), orgID, id); err != nil {
return util.ErrInternal("Stop failed", err)
if err := h.sandboxService.Restore(c.Request.Context(), orgID, id); err != nil {
return util.ErrInternal("Restore failed", err)
}
c.JSON(http.StatusOK, model.NewSuccessResponse("Sandbox stopped", nil))
return nil
}

func (h *SandboxHandler) Pause(c *gin.Context) error {
id := c.Param("id")

orgID, err := util.GetOrgIDFromContext(c)
if err != nil {
return err
}

if err := h.sandboxService.Pause(c.Request.Context(), orgID, id); err != nil {
return util.ErrInternal("Pause failed", err)
}
c.JSON(http.StatusOK, model.NewSuccessResponse("Sandbox paused", nil))
return nil
}

func (h *SandboxHandler) Resume(c *gin.Context) error {
id := c.Param("id")

orgID, err := util.GetOrgIDFromContext(c)
if err != nil {
return err
}

if err := h.sandboxService.Resume(c.Request.Context(), orgID, id); err != nil {
return util.ErrInternal("Resume failed", err)
}
c.JSON(http.StatusOK, model.NewSuccessResponse("Sandbox resumed", nil))
c.JSON(http.StatusOK, model.NewSuccessResponse("Sandbox restored", nil))
return nil
}
6 changes: 3 additions & 3 deletions mcp/tools.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ func toolCreateSandbox() mcp.Tool {
mcp.Description("Unique name for the sandbox (DNS-1123 subdomain format: lowercase alphanumeric and hyphens)"),
),
mcp.WithString("image",
mcp.Description("Image name in name or name:ver form (e.g. code, max, docker). Defaults to code if omitted."),
mcp.Description("Image name in name or name:ver form (e.g. code, docker-lite, max, docker). Defaults to code if omitted."),
),
mcp.WithNumber("cpu",
mcp.Description("Number of vCPUs (1-8). Defaults to 1."),
Expand All @@ -58,7 +58,7 @@ func toolCreateSandbox() mcp.Tool {
mcp.Description("Environment variables for the sandbox (string map)."),
),
mcp.WithBoolean("autoSleep",
mcp.Description("If true, auto-pause the VM after idle time."),
mcp.Description("If true, auto-snapshot the VM after idle time."),
),
mcp.WithString("region",
mcp.Description("Target region when supported by your account."),
Expand Down Expand Up @@ -109,7 +109,7 @@ func toolDeleteSandbox() mcp.Tool {
func toolExecuteCommand() mcp.Tool {
return mcp.NewTool(
"execute_command",
mcp.WithDescription("Execute a shell command in a sandbox and return the output. The sandbox must be running (it will be auto-resumed if paused)."),
mcp.WithDescription("Execute a shell command in a sandbox and return the output. The sandbox must be running (it will be auto-restored if snapshotted)."),
mcp.WithString("id",
mcp.Required(),
mcp.Description("The sandbox ID"),
Expand Down
4 changes: 2 additions & 2 deletions model/sandbox.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,7 @@ type Sandbox struct {
Status string `bson:"status" json:"status"`
AutoSleep bool `bson:"autoSleep" json:"autoSleep"`
LastActivityAt *time.Time `bson:"lastActivityAt,omitempty" json:"-"`
PausedAt *time.Time `bson:"pausedAt,omitempty" json:"-"`
StoppedAt *time.Time `bson:"stoppedAt,omitempty" json:"-"`
SnapshottedAt *time.Time `bson:"snapshottedAt,omitempty" json:"-"`
CreatedAt time.Time `bson:"createdAt" json:"createdAt"`
CreatedBy primitive.ObjectID `bson:"createdBy" json:"createdBy"`
OrgID primitive.ObjectID `bson:"orgId" json:"orgId"`
Expand All @@ -28,6 +27,7 @@ type Sandbox struct {
RefID string `bson:"refId,omitempty" json:"refId,omitempty"`
TapName string `bson:"tapName,omitempty" json:"-"`
NetNSName string `bson:"netnsName,omitempty" json:"-"`
MacAddress string `bson:"macAddress,omitempty" json:"-"`
TapDeleted bool `bson:"tapDeleted,omitempty" json:"-"`
BillingCompleted bool `bson:"billingCompleted,omitempty" json:"-"`
}
Expand Down
Loading