fix: reserve framed UploadPart real peak to stop OOM under concurrency by ServerSideHannes · Pull Request #86 · ServerSideHannes/s3proxy-python

ServerSideHannes · 2026-06-22T09:35:32Z

Problem

Pods kept getting OOMKilled (exit 137) on 2026.6.6 even after PR #85 made ES snapshot parts stream (framed). Logs confirmed the framed path engages (16MB signed parts, upload_path="framed") but RSS still climbed to ~380MB against the 512Mi limit while the governor believed only 64MB was in use.

Root cause

The memory governor under-counts. estimate_memory_footprint(PUT, 16MB) reserved only the internal-part size (8MB), but the framed path's real peak per request is ~3× that — three buffers are live at once: the plaintext frame, the AES-GCM ciphertext output, and the accumulated part bytearray (plus the transport body). Measured with tracemalloc:

content	internal part	old reservation	real peak
16MB (ES)	8MB	8MB	24.5MB (3.06×)
512MB	25.6MB	25.6MB	56.1MB (2.19×)

With the budget undercounted ~3×, the 64MB governor admitted ~8 concurrent 16MB parts → OOM.

Fix (root cause: make the governor honest)

crypto.streaming_upload_peak() — the framed path's real peak (2·part + 2·frame).
estimate_memory_footprint() reserves that instead of the bare part size, so the limiter can no longer admit more than fits.

No deploy-config change. Confirmed from logs that 100% of traffic is 16MB ES parts; their honest reservation (32MB) fits the existing 64MB budget with 2 concurrent uploads (~50MB real + base RSS, well under 512Mi). The governor budget is managed in GitOps and is left untouched.

Things I tried and rejected

Bumping the governor limit (64→256): unnecessary — at the fixed 64MB budget the honest reservation already stops the OOM. Reverted.
Removing the encrypt copy (append-in-place): re-measured, barely helps (24.5→24.5MB). The three concurrent buffers are inherent, so reducing copies is not the lever.

Why earlier tests missed it

The reservation unit tests asserted the under-count as correct (footprint == memory_bounded_part_size(cl)) — they encoded the bug as the spec.
The memory/OOM tests used UNSIGNED-PAYLOAD and part sizes >32MB, which always took the already-safe path; none drove a 16MB signed part and compared reserved-vs-actual memory.

New test_upload_reservation_vs_real.py drives the real _stream_and_upload_framed under tracemalloc and asserts reserved >= measured peak (and that the old part-size reservation was provably insufficient). It fails if the reservation drifts below reality or the path starts allocating more.

All 436 unit tests pass.

The memory governor reserved only the internal-part size (8MB for a 16MB ES snapshot part) while the framed upload path's real peak is ~3x that (accumulated ciphertext + AES-GCM encrypt transient + held plaintext frame + aiobotocore's body copy). The governor therefore admitted ~3x too many concurrent uploads and the pod was OOMKilled despite the streaming fix. - crypto.streaming_upload_peak(): honest peak the framed path holds. - estimate_memory_footprint() reserves it instead of the bare part size. - raise governor budget 64->256MB under the 512Mi pod limit (honest reservations no longer fit a 64MB budget for barman-scale parts). - regression test drives the REAL handler method under tracemalloc and asserts the reservation bounds measured peak; fixes the unit tests that had encoded the under-count as correct.

All real UploadPart traffic is 16MB ES snapshot parts (confirmed from logs); their honest reservation (32MB) fits the existing 64MB governor budget with room for 2 concurrent uploads, so no deploy-config change is needed. The governor under-count was the whole OOM; reserving the real peak fixes it at the fixed budget.

- set S3PROXY_BACKPRESSURE_TIMEOUT=1 in unit conftest so the 6 rejection tests that wait out the timeout don't burn 30s each (~180s saved) - mixed_workload: acquire a handful of small files instead of 512, avoiding gc.collect churn from per-release memory reclaim Pre-existing test_many_small_files_fit_in_budget (~25s, 1000 gc-heavy releases) left as-is.

…eout Replace the S3PROXY_BACKPRESSURE_TIMEOUT=1 env in shared conftest (fragile: read once at import, still slept 1s/test) with a unit-only autouse fixture that patches concurrency.BACKPRESSURE_TIMEOUT=0 at runtime. Rejection tests fill the budget permanently, so they reject with zero wall-clock wait; no import-order dependency and no effect on integration subprocesses.

A single request's honest framed-upload peak can exceed the whole governor budget when the budget is deliberately tight (the 16MB integration OOM test streams 30MB uploads). Hard-rejecting refused requests the proxy can stream fine. Clamp to the budget so such a request runs exclusively (concurrency 1) -- the budget bounds concurrency, not one request's peak. Verified against real minio: 20MB uploads succeed at a 16MB budget via backpressure.

ServerSideHannes added 5 commits June 22, 2026 11:35

ServerSideHannes merged commit fcd307d into main Jun 22, 2026
4 checks passed

ServerSideHannes deleted the fix/upload-reservation-oom branch June 22, 2026 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reserve framed UploadPart real peak to stop OOM under concurrency#86

fix: reserve framed UploadPart real peak to stop OOM under concurrency#86
ServerSideHannes merged 5 commits into
mainfrom
fix/upload-reservation-oom

ServerSideHannes commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ServerSideHannes commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix (root cause: make the governor honest)

Things I tried and rejected

Why earlier tests missed it

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ServerSideHannes commented Jun 22, 2026 •

edited

Loading