perf(durable-streams-rust): serve live-tail SSE from an epoll reactor (flat per-connection memory)#4662
Merged
balegas merged 3 commits intoJun 30, 2026
Conversation
Hand each live-tail SSE subscriber from its connection task to a fixed pool of N=available_parallelism() epoll reactor threads, each owning a generation-tracked slab. Per-subscriber resident memory collapses from a parked connection-task future to a compact slab entry, so it stops scaling with the number of active connections. Linux only; non-Linux keeps the existing inline hand-off path, and cold catch-up stays on it too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01A8Pz3PafV7mTwWmwv545Rh
Close subscriber sockets still queued in the intake at shutdown (and reject registrations after shutdown begins), so neither the fd nor the connection-limiter permit leaks — a held permit made drain() wait out its full grace period. Also handle write()==0 by closing the peer instead of reading a stale errno (which risked a spurious EAGAIN re-arm / EINTR spin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01A8Pz3PafV7mTwWmwv545Rh
98dd213 to
3754d64
Compare
msfstef
approved these changes
Jun 30, 2026
…-06-30 report Update the Benchmarks section to the current reactor build: write peak 860k → ~928k append/s, add the SSE live-tail reactor results (p99 ~0.5–2.5 ms across 64–2048 connections, ~27 MiB shared fan-out for 1000 subscribers), and point to results-2026-06-30/REPORT.md in ds-bench for the full matrix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01A8Pz3PafV7mTwWmwv545Rh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Durable Streams has to hold connections to millions of users across millions of streams. The cost that matters is therefore the per-connection memory held while a subscriber sits idle waiting for the next append — and it must be decoupled from the number of active connections (and from the number of streams), with a constant number of runtime tasks.
Before this change every live SSE subscriber was a parked async connection task. Even while idle, each one pins a full connection-state future (sized to the largest request handler) plus its read buffer for the lifetime of the stream. At fan-out / high-connection scale that parked future is the dominant resident cost, and it grows linearly with the number of connected subscribers — exactly the axis we need to keep flat.
Approach — an epoll reactor
Serve live-tail SSE from a fixed pool of N =
available_parallelism()reactor threads, each owning one epoll instance + an eventfd + a generation-tracked slab of subscribers. A connection task that produces a live-tail SSE response hands its socket (and its connection-limiter permit) to a reactor and returns — freeing the task future entirely. A subscriber then costs only:Resulting memory model:
Append → wakeup routing stays O(subscribers of that stream):
publish_durable_tailwalks only the stream's own subscriber list and signals the relevant shard eventfds — no global scans, and streams with no subscribers carry no list at all.Scope & safety
EPOLLOUTarmed only while backpressured;EAGAIN/partial-write handling; slab generation guard against ABA on reused slots; range reads taken under one consistent(file, file_base)snapshot so compaction can't tear them; the connection permit travels with the subscriber, so the connection stays counted and graceful drain still works; 15s keepalive + 60s lifetime cap match the inline path.Results (local)
Validation on GKE
Confirmed on a real cluster (
c4d-standard-16-lssdserver, 4-CPU limit; ds-bench SSE fan-out, 1 stream, subscriber sweep) — modified vs the prior server, pod working-set memory / delivery p99 at 1000 subscribers:So on real hardware the reactor cuts SSE fan-out pod memory by ~22% (cache off) / ~33% (cache on) at 1000 subscribers, with equal-or-better delivery latency and unchanged throughput (~75–80k ev/s) — matching the local per-subscriber slope above. Conformance suite green.
Design doc:
docs/superpowers/specs/2026-06-29-sse-reactor-flat-userspace-design.md.🤖 Generated with Claude Code