Skip to content

fix: reserve framed UploadPart real peak to stop OOM under concurrency#86

Merged
ServerSideHannes merged 5 commits into
mainfrom
fix/upload-reservation-oom
Jun 22, 2026
Merged

fix: reserve framed UploadPart real peak to stop OOM under concurrency#86
ServerSideHannes merged 5 commits into
mainfrom
fix/upload-reservation-oom

Conversation

@ServerSideHannes

@ServerSideHannes ServerSideHannes commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Problem

Pods kept getting OOMKilled (exit 137) on 2026.6.6 even after PR #85 made ES snapshot parts stream (framed). Logs confirmed the framed path engages (16MB signed parts, upload_path="framed") but RSS still climbed to ~380MB against the 512Mi limit while the governor believed only 64MB was in use.

Root cause

The memory governor under-counts. estimate_memory_footprint(PUT, 16MB) reserved only the internal-part size (8MB), but the framed path's real peak per request is ~3× that — three buffers are live at once: the plaintext frame, the AES-GCM ciphertext output, and the accumulated part bytearray (plus the transport body). Measured with tracemalloc:

content internal part old reservation real peak
16MB (ES) 8MB 8MB 24.5MB (3.06×)
512MB 25.6MB 25.6MB 56.1MB (2.19×)

With the budget undercounted ~3×, the 64MB governor admitted ~8 concurrent 16MB parts → OOM.

Fix (root cause: make the governor honest)

  • crypto.streaming_upload_peak() — the framed path's real peak (2·part + 2·frame).
  • estimate_memory_footprint() reserves that instead of the bare part size, so the limiter can no longer admit more than fits.

No deploy-config change. Confirmed from logs that 100% of traffic is 16MB ES parts; their honest reservation (32MB) fits the existing 64MB budget with 2 concurrent uploads (~50MB real + base RSS, well under 512Mi). The governor budget is managed in GitOps and is left untouched.

Things I tried and rejected

  • Bumping the governor limit (64→256): unnecessary — at the fixed 64MB budget the honest reservation already stops the OOM. Reverted.
  • Removing the encrypt copy (append-in-place): re-measured, barely helps (24.5→24.5MB). The three concurrent buffers are inherent, so reducing copies is not the lever.

Why earlier tests missed it

  1. The reservation unit tests asserted the under-count as correct (footprint == memory_bounded_part_size(cl)) — they encoded the bug as the spec.
  2. The memory/OOM tests used UNSIGNED-PAYLOAD and part sizes >32MB, which always took the already-safe path; none drove a 16MB signed part and compared reserved-vs-actual memory.

New test_upload_reservation_vs_real.py drives the real _stream_and_upload_framed under tracemalloc and asserts reserved >= measured peak (and that the old part-size reservation was provably insufficient). It fails if the reservation drifts below reality or the path starts allocating more.

All 436 unit tests pass.

The memory governor reserved only the internal-part size (8MB for a 16MB
ES snapshot part) while the framed upload path's real peak is ~3x that
(accumulated ciphertext + AES-GCM encrypt transient + held plaintext frame
+ aiobotocore's body copy). The governor therefore admitted ~3x too many
concurrent uploads and the pod was OOMKilled despite the streaming fix.

- crypto.streaming_upload_peak(): honest peak the framed path holds.
- estimate_memory_footprint() reserves it instead of the bare part size.
- raise governor budget 64->256MB under the 512Mi pod limit (honest
  reservations no longer fit a 64MB budget for barman-scale parts).
- regression test drives the REAL handler method under tracemalloc and
  asserts the reservation bounds measured peak; fixes the unit tests that
  had encoded the under-count as correct.
All real UploadPart traffic is 16MB ES snapshot parts (confirmed from
logs); their honest reservation (32MB) fits the existing 64MB governor
budget with room for 2 concurrent uploads, so no deploy-config change is
needed. The governor under-count was the whole OOM; reserving the real
peak fixes it at the fixed budget.
- set S3PROXY_BACKPRESSURE_TIMEOUT=1 in unit conftest so the 6 rejection
  tests that wait out the timeout don't burn 30s each (~180s saved)
- mixed_workload: acquire a handful of small files instead of 512, avoiding
  gc.collect churn from per-release memory reclaim

Pre-existing test_many_small_files_fit_in_budget (~25s, 1000 gc-heavy
releases) left as-is.
…eout

Replace the S3PROXY_BACKPRESSURE_TIMEOUT=1 env in shared conftest (fragile:
read once at import, still slept 1s/test) with a unit-only autouse fixture
that patches concurrency.BACKPRESSURE_TIMEOUT=0 at runtime. Rejection tests
fill the budget permanently, so they reject with zero wall-clock wait; no
import-order dependency and no effect on integration subprocesses.
A single request's honest framed-upload peak can exceed the whole governor
budget when the budget is deliberately tight (the 16MB integration OOM test
streams 30MB uploads). Hard-rejecting refused requests the proxy can stream
fine. Clamp to the budget so such a request runs exclusively (concurrency 1)
-- the budget bounds concurrency, not one request's peak. Verified against
real minio: 20MB uploads succeed at a 16MB budget via backpressure.
@ServerSideHannes ServerSideHannes merged commit fcd307d into main Jun 22, 2026
4 checks passed
@ServerSideHannes ServerSideHannes deleted the fix/upload-reservation-oom branch June 22, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant