Skip to content

perf(durable-streams-rust): serve live-tail SSE from an epoll reactor (flat per-connection memory)#4662

Merged
balegas merged 3 commits into
sse-fanout-per-subscriber-memoryfrom
sse-reactor-flat-userspace
Jun 30, 2026
Merged

perf(durable-streams-rust): serve live-tail SSE from an epoll reactor (flat per-connection memory)#4662
balegas merged 3 commits into
sse-fanout-per-subscriber-memoryfrom
sse-reactor-flat-userspace

Conversation

@balegas

@balegas balegas commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Why

Durable Streams has to hold connections to millions of users across millions of streams. The cost that matters is therefore the per-connection memory held while a subscriber sits idle waiting for the next append — and it must be decoupled from the number of active connections (and from the number of streams), with a constant number of runtime tasks.

Before this change every live SSE subscriber was a parked async connection task. Even while idle, each one pins a full connection-state future (sized to the largest request handler) plus its read buffer for the lifetime of the stream. At fan-out / high-connection scale that parked future is the dominant resident cost, and it grows linearly with the number of connected subscribers — exactly the axis we need to keep flat.

Approach — an epoll reactor

Serve live-tail SSE from a fixed pool of N = available_parallelism() reactor threads, each owning one epoll instance + an eventfd + a generation-tracked slab of subscribers. A connection task that produces a live-tail SSE response hands its socket (and its connection-limiter permit) to a reactor and returns — freeing the task future entirely. A subscriber then costs only:

  • a compact slab entry (~tens of bytes), and
  • the kernel socket.

Resulting memory model:

  • tasks = O(cores) — constant, independent of streams and connections.
  • memory = O(streams)·per-stream + O(connections)·slab-entry, with the two axes decoupled: idle streams cost nothing extra (no reactor thread is even spawned until the first SSE subscriber registers), and a connection never carries per-stream-sized state.

Append → wakeup routing stays O(subscribers of that stream): publish_durable_tail walks only the stream's own subscriber list and signals the relevant shard eventfds — no global scans, and streams with no subscribers carry no list at all.

Scope & safety

  • Linux only (epoll). Non-Linux builds keep the existing inline hand-off path unchanged.
  • Only the live-tail case runs on the reactor (root stream, tiering off, start at/after the live file base). Cold catch-up / fork / tiered reads stay on the proven inline path.
  • Byte-identical SSE framing, shared with the inline path, so the wire output is exactly what the conformance suite already validates.
  • Correctness: level-triggered EPOLLOUT armed only while backpressured; EAGAIN/partial-write handling; slab generation guard against ABA on reused slots; range reads taken under one consistent (file, file_base) snapshot so compaction can't tear them; the connection permit travels with the subscriber, so the connection stays counted and graceful drain still works; 15s keepalive + 60s lifetime cap match the inline path.

Results (local)

  • Per-subscriber resident memory: ~7.3 → ~0.64 KiB/sub (~11×) — controlled cgroup harness, server-only RSS, 0→1000 subscribers, identical build/config. 1000 live subscribers now add ~0.6 MB total instead of ~7 MB; the curve is essentially flat.
  • Conformance: clean — the full SSE suite passes. The only failures are 3 pre-existing long-poll timing flakes that the base branch fails identically.

Validation on GKE

Confirmed on a real cluster (c4d-standard-16-lssd server, 4-CPU limit; ds-bench SSE fan-out, 1 stream, subscriber sweep) — modified vs the prior server, pod working-set memory / delivery p99 at 1000 subscribers:

config pod mem peak/p50 (old → new) p99 (old → new)
wal (cache off) 27/23 → 22/18 MB 5.48 → 5.17 ms
wal (cache on) 26/21 → 15/14 MB 5.21 → 4.20 ms

So on real hardware the reactor cuts SSE fan-out pod memory by ~22% (cache off) / ~33% (cache on) at 1000 subscribers, with equal-or-better delivery latency and unchanged throughput (~75–80k ev/s) — matching the local per-subscriber slope above. Conformance suite green.

Design doc: docs/superpowers/specs/2026-06-29-sse-reactor-flat-userspace-design.md.

Stacked on #4661 (the inline Body::Sse hand-off). Base is set to that branch so this diff is reactor-only; review/merge #4661 first.

🤖 Generated with Claude Code

balegas and others added 2 commits June 30, 2026 00:08
Hand each live-tail SSE subscriber from its connection task to a fixed pool
of N=available_parallelism() epoll reactor threads, each owning a
generation-tracked slab. Per-subscriber resident memory collapses from a
parked connection-task future to a compact slab entry, so it stops scaling
with the number of active connections. Linux only; non-Linux keeps the
existing inline hand-off path, and cold catch-up stays on it too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01A8Pz3PafV7mTwWmwv545Rh
Close subscriber sockets still queued in the intake at shutdown (and reject
registrations after shutdown begins), so neither the fd nor the
connection-limiter permit leaks — a held permit made drain() wait out its
full grace period. Also handle write()==0 by closing the peer instead of
reading a stale errno (which risked a spurious EAGAIN re-arm / EINTR spin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01A8Pz3PafV7mTwWmwv545Rh
@balegas balegas force-pushed the sse-reactor-flat-userspace branch from 98dd213 to 3754d64 Compare June 29, 2026 23:09
…-06-30 report

Update the Benchmarks section to the current reactor build: write peak
860k → ~928k append/s, add the SSE live-tail reactor results (p99 ~0.5–2.5 ms
across 64–2048 connections, ~27 MiB shared fan-out for 1000 subscribers), and
point to results-2026-06-30/REPORT.md in ds-bench for the full matrix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01A8Pz3PafV7mTwWmwv545Rh
@balegas balegas merged commit 4ca377a into sse-fanout-per-subscriber-memory Jun 30, 2026
14 checks passed
@balegas balegas deleted the sse-reactor-flat-userspace branch June 30, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants