perf(server-rust): cut SSE fan-out per-subscriber memory ~60%#4661
Merged
Conversation
Each live SSE subscriber used to spawn a producer task, allocate an mpsc channel, and keep the whole connection state machine resident while parked (~14 KiB/sub). Now: - SSE is produced inline via a new pull-based `Body::Sse` / `EventSource` (no per-subscriber producer task, no channel), and - the connection is handed off to a small dedicated streaming task so the large `conn_loop` future is freed while a subscriber is parked. An idle subscriber's resident footprint collapses to roughly a cursor over the shared stream tail. Measured (isolated Linux/cgroup, wal): ~13.6 -> ~5 KiB/sub (~60%); at 2000 subs ~30 -> ~13.5 MiB, on a 3 MiB idle baseline. Verified: 87 unit tests, conformance 326/326 across wal/cache/memory modes, single Connection header on the wire, keep-alive reuse intact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (flat per-connection memory) (#4662) ## Why Durable Streams has to hold connections to **millions of users across millions of streams**. The cost that matters is therefore the *per-connection* memory held while a subscriber sits idle waiting for the next append — and it must be **decoupled from the number of active connections** (and from the number of streams), with a **constant** number of runtime tasks. Before this change every live SSE subscriber was a parked async connection task. Even while idle, each one pins a full connection-state future (sized to the largest request handler) plus its read buffer for the lifetime of the stream. At fan-out / high-connection scale that parked future is the dominant resident cost, and it grows linearly with the number of connected subscribers — exactly the axis we need to keep flat. ## Approach — an epoll reactor Serve live-tail SSE from a fixed pool of **N = `available_parallelism()` reactor threads**, each owning one epoll instance + an eventfd + a generation-tracked slab of subscribers. A connection task that produces a live-tail SSE response hands its socket (and its connection-limiter permit) to a reactor and returns — freeing the task future entirely. A subscriber then costs only: - a compact slab entry (~tens of bytes), and - the kernel socket. Resulting memory model: - **tasks = O(cores)** — constant, independent of streams and connections. - **memory = O(streams)·per-stream + O(connections)·slab-entry**, with the two axes **decoupled**: idle streams cost nothing extra (no reactor thread is even spawned until the first SSE subscriber registers), and a connection never carries per-stream-sized state. Append → wakeup routing stays **O(subscribers of that stream)**: `publish_durable_tail` walks only the stream's own subscriber list and signals the relevant shard eventfds — no global scans, and streams with no subscribers carry no list at all. ## Scope & safety - **Linux only** (epoll). Non-Linux builds keep the existing inline hand-off path unchanged. - Only the **live-tail** case runs on the reactor (root stream, tiering off, start at/after the live file base). Cold catch-up / fork / tiered reads stay on the proven inline path. - **Byte-identical SSE framing**, shared with the inline path, so the wire output is exactly what the conformance suite already validates. - Correctness: level-triggered `EPOLLOUT` armed only while backpressured; `EAGAIN`/partial-write handling; slab generation guard against ABA on reused slots; range reads taken under one consistent `(file, file_base)` snapshot so compaction can't tear them; the connection permit travels with the subscriber, so the connection stays counted and graceful drain still works; 15s keepalive + 60s lifetime cap match the inline path. ## Results (local) - **Per-subscriber resident memory: ~7.3 → ~0.64 KiB/sub (~11×)** — controlled cgroup harness, server-only RSS, 0→1000 subscribers, identical build/config. 1000 live subscribers now add ~0.6 MB total instead of ~7 MB; the curve is essentially flat. - **Conformance: clean** — the full SSE suite passes. The only failures are 3 pre-existing long-poll timing flakes that the base branch fails identically. ## Validation on GKE Confirmed on a real cluster (`c4d-standard-16-lssd` server, 4-CPU limit; ds-bench SSE fan-out, 1 stream, subscriber sweep) — modified vs the prior server, pod working-set memory / delivery p99 at **1000 subscribers**: | config | pod mem peak/p50 (old → new) | p99 (old → new) | |---|---|---| | wal (cache off) | 27/23 → **22/18 MB** | 5.48 → **5.17 ms** | | wal (cache on) | 26/21 → **15/14 MB** | 5.21 → **4.20 ms** | So on real hardware the reactor cuts SSE fan-out pod memory by **~22% (cache off)** / **~33% (cache on)** at 1000 subscribers, with equal-or-better delivery latency and unchanged throughput (~75–80k ev/s) — matching the local per-subscriber slope above. Conformance suite green. Design doc: `docs/superpowers/specs/2026-06-29-sse-reactor-flat-userspace-design.md`. > Stacked on #4661 (the inline `Body::Sse` hand-off). Base is set to that branch so this diff is reactor-only; review/merge #4661 first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4661 +/- ##
==========================================
- Coverage 60.03% 60.00% -0.03%
==========================================
Files 395 395
Lines 43747 43747
Branches 12579 12581 +2
==========================================
- Hits 26262 26249 -13
- Misses 17407 17420 +13
Partials 78 78
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
msfstef
approved these changes
Jun 30, 2026
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
The epoll SSE reactor (and the inline fallback) unilaterally closes the
socket when a live SSE stream ends — on stream closure or at
SSE_MAX_DURATION. The response nevertheless advertised
`Connection: keep-alive`, so a pooling HTTP client (undici, used by the
conformance suite) would return the socket to its pool and pipeline the
next request onto it. The server's `close()` then saw those unread
request bytes in the receive buffer and sent a RST instead of a FIN,
discarding the still-in-flight SSE response (the final data + close
frames) that were already queued in the kernel send buffer. The client
surfaced this as `UND_ERR_SOCKET` ("other side closed").
This raced in the conformance suite's stream-closure SSE tests
(sse-live-reader-receives-final-append-on-close, sse-closed-stream-no-cursor)
and, via the same connection-reuse corruption, the long-poll
"wait for new data" test — all intermittently, ~1 run in 19.
SSE responses are single-use here, so advertise `Connection: close`. The
client no longer reuses the socket, so the server's close is a clean FIN
that delivers the full response. Verified with 100 consecutive clean
full-suite conformance runs (50 memory, 25 wal, 25 wal-read-offload) on
Linux, where it previously failed ~1 in 19.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The Rust durable-streams server added linear memory per SSE subscriber in the fan-out scenario (1 stream, 1 writer, many subscribers), while the ursula reference keeps it flat — see ds-bench results. This trims the per-subscriber footprint by ~60%.
Root cause
With the socket-buffer cap already in place, the residual growth was userspace heap per connection. Each live SSE subscriber:
conn_loopstate machine future resident while parked (up toSSE_MAX_DURATION),≈ 14 KiB/subscriber.
Fix
Body::Sse/EventSource— no per-subscriber producer task, no channel. All caught-up subscribers still share the one resident tail chunk.conn_loopfuture is freed while a subscriber is parked. The connection permit moves with it (stays counted); the socket closes when the stream ends (the client reconnects from its last offset).An idle subscriber's resident state collapses to roughly a cursor over the shared stream tail.
Results
Isolated server-process RSS on Linux/cgroup (the ds-bench methodology), wal mode:
This keeps our low idle baseline (vs ursula's ~15 MiB mimalloc floor) while bringing the slope into the same regime, so we're comparable at scale instead of worse.
Allocators evaluated (and rejected)
mimalloc and jemalloc were both measured — both are worse for our allocation shape (mimalloc: ~6× higher absolute, ~3× worse slope), because our per-connection state is kilobyte-sized rather than cursor-sized. glibc stays the default.
Verification
cargo test --release: 87 passedConnection: keep-aliveheader on the wire; keep-alive reuse intact🤖 Generated with Claude Code