Skip to content

feat(stack): add obol node for k3s multi-node clusters#637

Open
bussyjd wants to merge 1 commit into
mainfrom
feat/obol-node-multinode
Open

feat(stack): add obol node for k3s multi-node clusters#637
bussyjd wants to merge 1 commit into
mainfrom
feat/obol-node-multinode

Conversation

@bussyjd

@bussyjd bussyjd commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Adds a first-class CLI for adding worker nodes to a k3s-backed stack — the multi-node "master runs pods on another node" path that's fatal on the k3d/Docker-Desktop master but works natively over the LAN.

What

  • obol node token — prints the k3s agent-join one-liner (K3S_URL + K3S_TOKEN, version-pinned to the running server). --json for machine use, --server-url override.
  • obol node list — nodes with their obol.tech/accelerator label.
  • Both guard on the backend: on k3d they fail fast (a Docker-Desktop master cannot accept remote node joins — its flannel overlay is not routable off-host).
  • tls-san plumbing: k3s-config.yaml now templates {{NODE_IP}}/{{NODE_HOSTNAME}} (substituted in backend_k3s.Init) so workers can join by LAN IP or hostname. (k3s already auto-adds the primary node IP, so this mainly adds an explicit hostname SAN + cert determinism.)

Why

On Docker Desktop the k3d server's flannel VTEP is a VM-internal 172.x IP, unroutable from the LAN — a joined agent registers Ready but cross-node pod networking hangs. On a native k3s master it all works. This makes that path ergonomic.

Tests / validation

  • 7 new table-driven tests (internal/stack/node_test.go, backend_k3s_init_test.go); go build ./... + full internal/stack suite green; backend guard verified live.
  • Validated end-to-end on two LAN boxes: native k3s master scheduling cross-node pods (CoreDNS + flannel VXLAN) and a 4-bit QLoRA fine-tune onto a remote RTX 2060, adapter persisted to a node-local local-path PVC.

Faithful to the existing Backend interface and func xCommand(cfg) *cli.Command pattern.

obol node token prints the k3s agent-join one-liner (K3S_URL + K3S_TOKEN,
version-pinned to the server); obol node list shows nodes with their
accelerator label. Both guard against the k3d backend — a Docker-Desktop
master's flannel overlay isn't routable off-host, so remote node joins
only work on a native k3s master.

Also templates {{NODE_IP}}/{{NODE_HOSTNAME}} into k3s-config.yaml's
tls-san (via backend_k3s Init) so workers can join the server by LAN IP
or hostname; k3s already auto-adds the primary node IP, so this mainly
buys an explicit hostname SAN + a deterministic cert.

Validated live on two LAN boxes: a native k3s master scheduling
cross-node pods (CoreDNS + flannel) and a 4-bit QLoRA fine-tune onto a
remote RTX 2060, adapter persisted to a node-local local-path PVC.

@LuuOW LuuOW left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technical audit: code implementation verified for system consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants