Skip to content

perf(server-rust): cut SSE fan-out per-subscriber memory ~60%#4661

Merged
balegas merged 5 commits into
mainfrom
sse-fanout-per-subscriber-memory
Jun 30, 2026
Merged

perf(server-rust): cut SSE fan-out per-subscriber memory ~60%#4661
balegas merged 5 commits into
mainfrom
sse-fanout-per-subscriber-memory

Conversation

@balegas

@balegas balegas commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What

The Rust durable-streams server added linear memory per SSE subscriber in the fan-out scenario (1 stream, 1 writer, many subscribers), while the ursula reference keeps it flat — see ds-bench results. This trims the per-subscriber footprint by ~60%.

Root cause

With the socket-buffer cap already in place, the residual growth was userspace heap per connection. Each live SSE subscriber:

  • spawned a producer task + allocated an mpsc channel, and
  • kept the whole conn_loop state machine future resident while parked (up to SSE_MAX_DURATION),
  • plus a pinned 4 KiB read buffer.

14 KiB/subscriber.

Fix

  • Inline SSE production via a new pull-based Body::Sse / EventSource — no per-subscriber producer task, no channel. All caught-up subscribers still share the one resident tail chunk.
  • Hand off to a dedicated streaming task when a request is SSE, so the large conn_loop future is freed while a subscriber is parked. The connection permit moves with it (stays counted); the socket closes when the stream ends (the client reconnects from its last offset).

An idle subscriber's resident state collapses to roughly a cursor over the shared stream tail.

Results

Isolated server-process RSS on Linux/cgroup (the ds-bench methodology), wal mode:

per-subscriber slope @ 2000 subs idle baseline
before ~13.6 KiB/sub ~30 MiB 3 MiB
after ~5 KiB/sub ~13.5 MiB 3 MiB

This keeps our low idle baseline (vs ursula's ~15 MiB mimalloc floor) while bringing the slope into the same regime, so we're comparable at scale instead of worse.

Note: the official ds-bench SSE suite (scripts/run-sse.sh) is GKE-only; the numbers above are from a faithful local Docker reproduction using the same cgroup working-set metric. Re-running the real suite on GKE is recommended to confirm the slope drop at scale.

Allocators evaluated (and rejected)

mimalloc and jemalloc were both measured — both are worse for our allocation shape (mimalloc: ~6× higher absolute, ~3× worse slope), because our per-connection state is kilobyte-sized rather than cursor-sized. glibc stays the default.

Verification

  • cargo test --release: 87 passed
  • Conformance suite: 326 passed across wal / tail-cache-on / memory (Linux) modes
  • Single correct Connection: keep-alive header on the wire; keep-alive reuse intact

🤖 Generated with Claude Code

Each live SSE subscriber used to spawn a producer task, allocate an mpsc
channel, and keep the whole connection state machine resident while parked
(~14 KiB/sub). Now:

- SSE is produced inline via a new pull-based `Body::Sse` / `EventSource`
  (no per-subscriber producer task, no channel), and
- the connection is handed off to a small dedicated streaming task so the
  large `conn_loop` future is freed while a subscriber is parked.

An idle subscriber's resident footprint collapses to roughly a cursor over
the shared stream tail. Measured (isolated Linux/cgroup, wal): ~13.6 -> ~5
KiB/sub (~60%); at 2000 subs ~30 -> ~13.5 MiB, on a 3 MiB idle baseline.
Verified: 87 unit tests, conformance 326/326 across wal/cache/memory modes,
single Connection header on the wire, keep-alive reuse intact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (flat per-connection memory) (#4662)

## Why

Durable Streams has to hold connections to **millions of users across
millions of streams**. The cost that matters is therefore the
*per-connection* memory held while a subscriber sits idle waiting for
the next append — and it must be **decoupled from the number of active
connections** (and from the number of streams), with a **constant**
number of runtime tasks.

Before this change every live SSE subscriber was a parked async
connection task. Even while idle, each one pins a full connection-state
future (sized to the largest request handler) plus its read buffer for
the lifetime of the stream. At fan-out / high-connection scale that
parked future is the dominant resident cost, and it grows linearly with
the number of connected subscribers — exactly the axis we need to keep
flat.

## Approach — an epoll reactor

Serve live-tail SSE from a fixed pool of **N = `available_parallelism()`
reactor threads**, each owning one epoll instance + an eventfd + a
generation-tracked slab of subscribers. A connection task that produces
a live-tail SSE response hands its socket (and its connection-limiter
permit) to a reactor and returns — freeing the task future entirely. A
subscriber then costs only:

- a compact slab entry (~tens of bytes), and
- the kernel socket.

Resulting memory model:

- **tasks = O(cores)** — constant, independent of streams and
connections.
- **memory = O(streams)·per-stream + O(connections)·slab-entry**, with
the two axes **decoupled**: idle streams cost nothing extra (no reactor
thread is even spawned until the first SSE subscriber registers), and a
connection never carries per-stream-sized state.

Append → wakeup routing stays **O(subscribers of that stream)**:
`publish_durable_tail` walks only the stream's own subscriber list and
signals the relevant shard eventfds — no global scans, and streams with
no subscribers carry no list at all.

## Scope & safety

- **Linux only** (epoll). Non-Linux builds keep the existing inline
hand-off path unchanged.
- Only the **live-tail** case runs on the reactor (root stream, tiering
off, start at/after the live file base). Cold catch-up / fork / tiered
reads stay on the proven inline path.
- **Byte-identical SSE framing**, shared with the inline path, so the
wire output is exactly what the conformance suite already validates.
- Correctness: level-triggered `EPOLLOUT` armed only while
backpressured; `EAGAIN`/partial-write handling; slab generation guard
against ABA on reused slots; range reads taken under one consistent
`(file, file_base)` snapshot so compaction can't tear them; the
connection permit travels with the subscriber, so the connection stays
counted and graceful drain still works; 15s keepalive + 60s lifetime cap
match the inline path.

## Results (local)

- **Per-subscriber resident memory: ~7.3 → ~0.64 KiB/sub (~11×)** —
controlled cgroup harness, server-only RSS, 0→1000 subscribers,
identical build/config. 1000 live subscribers now add ~0.6 MB total
instead of ~7 MB; the curve is essentially flat.
- **Conformance: clean** — the full SSE suite passes. The only failures
are 3 pre-existing long-poll timing flakes that the base branch fails
identically.

## Validation on GKE

Confirmed on a real cluster (`c4d-standard-16-lssd` server, 4-CPU limit;
ds-bench SSE fan-out, 1 stream, subscriber sweep) — modified vs the
prior server, pod working-set memory / delivery p99 at **1000
subscribers**:

| config | pod mem peak/p50 (old → new) | p99 (old → new) |
|---|---|---|
| wal (cache off) | 27/23 → **22/18 MB** | 5.48 → **5.17 ms** |
| wal (cache on)  | 26/21 → **15/14 MB** | 5.21 → **4.20 ms** |

So on real hardware the reactor cuts SSE fan-out pod memory by **~22%
(cache off)** / **~33% (cache on)** at 1000 subscribers, with
equal-or-better delivery latency and unchanged throughput (~75–80k ev/s)
— matching the local per-subscriber slope above. Conformance suite
green.

Design doc:
`docs/superpowers/specs/2026-06-29-sse-reactor-flat-userspace-design.md`.

> Stacked on #4661 (the inline `Body::Sse` hand-off). Base is set to
that branch so this diff is reactor-only; review/merge #4661 first.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.00%. Comparing base (b66ebf7) to head (5f6c482).
⚠️ Report is 4 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4661      +/-   ##
==========================================
- Coverage   60.03%   60.00%   -0.03%     
==========================================
  Files         395      395              
  Lines       43747    43747              
  Branches    12579    12581       +2     
==========================================
- Hits        26262    26249      -13     
- Misses      17407    17420      +13     
  Partials       78       78              
Flag Coverage Δ
packages/agents 72.64% <ø> (ø)
packages/agents-mcp 77.70% <ø> (ø)
packages/agents-mobile 80.67% <ø> (ø)
packages/agents-runtime 83.72% <ø> (ø)
packages/agents-server 75.47% <ø> (-0.18%) ⬇️
packages/agents-server-ui 8.32% <ø> (ø)
packages/electric-ax 51.06% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.83% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 60.00% <ø> (-0.03%) ⬇️
unit-tests 60.00% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@netlify

netlify Bot commented Jun 30, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 67683cc
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a439fe519a29a00084542d7
😎 Deploy Preview https://deploy-preview-4661--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify

netlify Bot commented Jun 30, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 332b296
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a439febdc69690007321523
😎 Deploy Preview https://deploy-preview-4661--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

The epoll SSE reactor (and the inline fallback) unilaterally closes the
socket when a live SSE stream ends — on stream closure or at
SSE_MAX_DURATION. The response nevertheless advertised
`Connection: keep-alive`, so a pooling HTTP client (undici, used by the
conformance suite) would return the socket to its pool and pipeline the
next request onto it. The server's `close()` then saw those unread
request bytes in the receive buffer and sent a RST instead of a FIN,
discarding the still-in-flight SSE response (the final data + close
frames) that were already queued in the kernel send buffer. The client
surfaced this as `UND_ERR_SOCKET` ("other side closed").

This raced in the conformance suite's stream-closure SSE tests
(sse-live-reader-receives-final-append-on-close, sse-closed-stream-no-cursor)
and, via the same connection-reuse corruption, the long-poll
"wait for new data" test — all intermittently, ~1 run in 19.

SSE responses are single-use here, so advertise `Connection: close`. The
client no longer reuses the socket, so the server's close is a clean FIN
that delivers the full response. Verified with 100 consecutive clean
full-suite conformance runs (50 memory, 25 wal, 25 wal-read-offload) on
Linux, where it previously failed ~1 in 19.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@balegas balegas merged commit 3c6e2ce into main Jun 30, 2026
64 of 65 checks passed
@balegas balegas deleted the sse-fanout-per-subscriber-memory branch June 30, 2026 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants