Skip to content

Fix tc class/filter leaks on the bridge and reconcile them periodically#289

Open
hiroTamada wants to merge 2 commits into
mainfrom
hypeship/fix-tc-class-leak
Open

Fix tc class/filter leaks on the bridge and reconcile them periodically#289
hiroTamada wants to merge 2 commits into
mainfrom
hypeship/fix-tc-class-leak

Conversation

@hiroTamada

@hiroTamada hiroTamada commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Per-VM HTB classes and basic filters leak on the bridge on busy hosts. Since the bridge filter list is evaluated linearly for every VM-egress packet, accumulated leaks degrade network throughput for every VM on the host (observed: ~1,880 stale class/filter pairs vs ~6 live VMs on one production host).

Leak mechanisms fixed

  1. Wrong-ID deletes. Cleanup paths that have no persisted class ID (periodic TAP reaper, orphaned-TAP cleanup, recreate-existing-TAP, TAP-create failure) fell back to the hash-derived class ID. Whenever the original tc class add had been collision-probed to hash+N, these paths deleted the wrong ID and leaked the real class and its filter — a feedback loop, since more stale classes cause more probing. removeVMClass now resolves the class from the bridge filter table via the TAP's ifindex (meta(rt_iif eq N) is the authoritative tap→class link). When neither a persisted ID nor a filter resolves, it deletes nothing and defers to the reconciler instead of guessing.

  2. Silent delete failures. removeVMClass ran every tc delete best-effort with no logging. Failures are now logged and counted via hypeman_network_tc_cleanup_failures_total{operation}.

  3. Reconcile never ran and missed filters. CleanupOrphanedClasses only ran at process startup and iterated classes only, so dangling filters (class deleted, filter left, or TAP gone) were invisible to it. It now:

    • runs every TAP GC tick (60s), not just at startup
    • holds the tc mutex so it can't observe a half-applied rate limit
    • sweeps filters not anchored to a live TAP ifindex, then classes not referenced by a live TAP's filter/derived ID or a stored allocation classid
    • bails out if filters exist but no rt_iif ematch parses, so an unexpected tc output format can't cause a sweep of live state
    • counts removals via hypeman_network_tc_orphans_swept_total{kind}
  4. Double-release noise. The persisted classid file is cleared after an instance-scoped release, so the routine stop→delete double release no longer re-deletes the class.

Out-of-band recovery

scripts/sweep-stale-tc.sh applies the same sweep logic standalone (dry-run by default, --apply to delete) to recover already-polluted hosts immediately, without waiting for a deploy/restart. Candidates are snapshotted before keep-sets so anything allocated mid-run is never considered stale.

Test plan

  • go build / go vet on lib/network and lib/instances
  • go test ./lib/network/ (includes new parseBridgeFilters unit tests)
  • bash -n + dry-run of scripts/sweep-stale-tc.sh
  • Full make test-linux integration suite — not run locally (requires KVM + HTB kernel modules not available in my environment); needs CI or a metal host
  • Validate on a polluted staging/prod host: run the sweep script dry-run, compare counts against tc class show dev vmbr0 | wc -l, then --apply and confirm hypeman_network_tc_class_collisions_total and softirq CPU drop

🤖 Generated with Claude Code


Note

Medium Risk
Changes Linux bridge tc cleanup and periodic sweeps that delete kernel state; safeguards exist but mis-parsing or races could still affect live VMs’ rate limits.

Overview
Fixes leaked bridge HTB classes and tc filters that slow every VM’s egress by growing the per-packet filter walk.

Release and delete paths now resolve the real class via bridge filters (meta(rt_iif eq …)), avoid deleting another VM’s class when collision-probed IDs differ from the hash, log failed tc deletes, and clear persisted classid after instance release so stop→delete does not repeat cleanup.

Periodic reconciliation: the 60s TAP GC tick also runs CleanupOrphanedClasses (under tcMu), sweeping dangling filters then orphan classes using live TAP ifindexes, allocations, and parse safety checks. New metrics track cleanup failures and swept orphans; sweep-stale-tc.sh offers the same logic for one-off host recovery (dry-run by default).

Reviewed by Cursor Bugbot for commit 4ea5f81. Bugbot is set up for automated code reviews on this repo. Configure here.

Per-VM HTB classes and basic filters leaked on busy hosts, and the
filter list is walked linearly for every VM-egress packet, degrading
all VM network throughput as leaks accumulate.

Root causes fixed:
- Cleanup paths with no persisted class ID fell back to the hash-derived
  ID, deleting the wrong class/filter whenever the original add had been
  collision-probed to a different ID. The class is now resolved from the
  bridge filter table via the TAP's ifindex (the authoritative tap->class
  link); when nothing resolves, cleanup defers to the reconciler instead
  of guessing.
- removeVMClass swallowed every delete error; failures are now logged
  and counted (hypeman_network_tc_cleanup_failures_total).
- CleanupOrphanedClasses only ran at process startup and never swept
  dangling filters. It now runs every TAP GC tick under the tc mutex,
  sweeps filters not anchored to a live TAP, and counts removals
  (hypeman_network_tc_orphans_swept_total). It bails if no rt_iif
  matches parse, so an unexpected tc output format can't trigger a
  sweep of live state.
- The persisted classid is cleared after instance release so the
  routine stop-then-delete double release doesn't re-delete the class.

scripts/sweep-stale-tc.sh applies the same sweep out-of-band (dry-run
by default) to recover already-polluted hosts without a restart.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hiroTamada hiroTamada marked this pull request as ready for review June 12, 2026 15:33
Comment thread lib/network/bridge_linux.go
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Stops VM network throughput from degrading on busy hosts by fixing stale tc class/filter leaks on the hypeman bridge and running periodic reconciliation every 60s.

Intended effect:

  • hypeman_network_tc_orphans_swept_total: new metric, baseline 0 pre-deploy; confirmed working if it increments on previously-polluted hosts then trends toward 0 as the backlog drains
  • hypeman_network_tc_cleanup_failures_total: new metric, baseline 0 pre-deploy; confirmed healthy if it stays near 0 at steady state (non-zero means a tc delete failed and is now visible for the first time)
  • Hypeman spawn failure rate (kernel_invocation_spawn_total{backend_type="hypeman", success=false}): baseline 9–180/hr; confirmed if it stays in this range post-deploy

Risks:

  • Lock contention on polluted hosts — reconciler holds tcMu while issuing hundreds of tc deletes on hosts with ~1,880 stale entries; signal: kernel_invocation_spawn_total{success=false} spike above 500/hr in the 1–2 min after first GC tick post-deploy; alert if sustained >2h
  • Reconciler disarmed by filter parse mismatch — if tc filter show output format doesn't match expected format, the reconciler bails every tick; signal: hypeman_network_tc_cleanup_failures_total{operation="filter_parse"} incrementing, or WARN log "no rt_iif matches parsed from tc filter output"; alert if non-zero
  • Live class incorrectly swept — if ListAllocations returns stale data, a running VM's class could be deleted, halting its egress; signal: hypeman_network_tc_cleanup_failures_total{operation="filter_list"} non-zero or kernel_invocation_spawn_total{success=false} elevation; alert if >500 failed spawns/hr post-deploy

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

A persisted classID can go stale (saveClassID failing after a
collision-probed add), in which case deleting by it removes the wrong
class — or worse, one belonging to another live VM if the stale ID
collides. Resolve the class from the TAP's own filter when one exists,
and skip flowid-matched deletes whose anchor is a live interface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4ea5f81. Configure here.

"bridge", bridgeName, "tap", tapName, "class", fullClassID,
"error", err, "output", strings.TrimSpace(string(output)))
m.recordTCCleanupFailure(ctx, "class_del")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter list failure skips safety

Medium Severity

When listBridgeFilters fails in removeVMClass, cleanup still proceeds to delete the HTB class using only the persisted classID. That bypasses the new filter-table checks (including classClaimedByOther) meant to avoid removing another VM’s class after collision probing, reintroducing the wrong-ID delete this change is meant to fix.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4ea5f81. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant