Skip to content

feat: cross-node migration (nodeName affinity + migration state machine)#11

Open
tonicmuroq wants to merge 2 commits into
mainfrom
feat/cross-node-migration
Open

feat: cross-node migration (nodeName affinity + migration state machine)#11
tonicmuroq wants to merge 2 commits into
mainfrom
feat/cross-node-migration

Conversation

@tonicmuroq

Copy link
Copy Markdown
Contributor

Operator side of cross-node migrate(vmname, node): the control plane patches CocoonSet.spec.nodeName, the operator does the rest.

What

  • Placement (buildAgentPod): the main agent (slot 0) gets a required hostname nodeAffinity from spec.nodeName instead of a hard NodeName bind — it lands on the target only if it fits and the node is schedulable, else stays Pending (respects capacity/cordon, no OOM). Sub-agents keep their hard-bind to the main's node.
  • Migration state machine (reconcileMigration): a pure observation function over durable state (spec.nodeName, the pod, the epoch :hibernate snapshot) — set internal hibernate annotation → wait for snapshot → delete old pod → recreate on target with restore-from-hibernate → wait for the restored VMID → drop the snapshot. Idempotent and crash-recoverable; runs before applyUnsuspend so its hibernate annotation isn't cleared mid-flight. Ordering gates: old pod deleted only after the snapshot lands; snapshot dropped only after the new VM has a fresh VMID. Surfaces CocoonSetPhaseMigrating. Scoped to the main agent (one VM per CocoonSet).
  • Imports the regenerated CocoonSet CRD.

Dependency

Depends on cocoonstack/cocoon-common#3 (spec.nodeName + Migrating phase). go.mod pins the branch commit via pseudo-version; bump to the cocoon-common release tag after #3 merges.

Tests

migrate_test.go (7 transitions incl. both ordering gates), pods_test.go (3 affinity cases); full suite + make lint clean on linux + darwin.

Not in scope

Control-plane migrate API + IP backfill + involuntary-eviction reconcile (simular-pro-vm-service); end-to-end + crash-injection tests (need a cluster).

buildAgentPod gives the main agent (slot 0) a required hostname nodeAffinity
from spec.NodeName instead of a hard NodeName bind, so a migrate target that
won't fit stays Pending rather than OOM-ing, and cordon is respected.
Sub-agents keep their hard-bind to the main's node. Imports the updated
CocoonSet CRD and bumps cocoon-common for the NodeName field.
Reconcile drives the main agent across nodes when spec.nodeName drifts from
where the pod runs. reconcileMigration is a pure observation function over
durable state (spec.nodeName, the pod, the epoch :hibernate snapshot): set the
internal hibernate annotation -> wait for the snapshot -> delete the old pod ->
recreate on the target with restore-from-hibernate -> wait for the restored
VMID -> drop the snapshot. Idempotent and crash-recoverable; runs before
applyUnsuspend so its hibernate annotation isn't cleared mid-flight. Ordering
gates: old pod deleted only after the snapshot lands; snapshot dropped only
after the new VM has a fresh VMID. Surfaces CocoonSetPhaseMigrating. Scoped to
the main agent (one VM per CocoonSet). Bumps cocoon-common for the phase enum.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant