fix: reserve framed UploadPart real peak to stop OOM under concurrency#86
Merged
Conversation
The memory governor reserved only the internal-part size (8MB for a 16MB ES snapshot part) while the framed upload path's real peak is ~3x that (accumulated ciphertext + AES-GCM encrypt transient + held plaintext frame + aiobotocore's body copy). The governor therefore admitted ~3x too many concurrent uploads and the pod was OOMKilled despite the streaming fix. - crypto.streaming_upload_peak(): honest peak the framed path holds. - estimate_memory_footprint() reserves it instead of the bare part size. - raise governor budget 64->256MB under the 512Mi pod limit (honest reservations no longer fit a 64MB budget for barman-scale parts). - regression test drives the REAL handler method under tracemalloc and asserts the reservation bounds measured peak; fixes the unit tests that had encoded the under-count as correct.
All real UploadPart traffic is 16MB ES snapshot parts (confirmed from logs); their honest reservation (32MB) fits the existing 64MB governor budget with room for 2 concurrent uploads, so no deploy-config change is needed. The governor under-count was the whole OOM; reserving the real peak fixes it at the fixed budget.
- set S3PROXY_BACKPRESSURE_TIMEOUT=1 in unit conftest so the 6 rejection tests that wait out the timeout don't burn 30s each (~180s saved) - mixed_workload: acquire a handful of small files instead of 512, avoiding gc.collect churn from per-release memory reclaim Pre-existing test_many_small_files_fit_in_budget (~25s, 1000 gc-heavy releases) left as-is.
…eout Replace the S3PROXY_BACKPRESSURE_TIMEOUT=1 env in shared conftest (fragile: read once at import, still slept 1s/test) with a unit-only autouse fixture that patches concurrency.BACKPRESSURE_TIMEOUT=0 at runtime. Rejection tests fill the budget permanently, so they reject with zero wall-clock wait; no import-order dependency and no effect on integration subprocesses.
A single request's honest framed-upload peak can exceed the whole governor budget when the budget is deliberately tight (the 16MB integration OOM test streams 30MB uploads). Hard-rejecting refused requests the proxy can stream fine. Clamp to the budget so such a request runs exclusively (concurrency 1) -- the budget bounds concurrency, not one request's peak. Verified against real minio: 20MB uploads succeed at a 16MB budget via backpressure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Pods kept getting OOMKilled (exit 137) on
2026.6.6even after PR #85 made ES snapshot parts stream (framed). Logs confirmed the framed path engages (16MB signed parts,upload_path="framed") but RSS still climbed to ~380MB against the 512Mi limit while the governor believed only 64MB was in use.Root cause
The memory governor under-counts.
estimate_memory_footprint(PUT, 16MB)reserved only the internal-part size (8MB), but the framed path's real peak per request is ~3× that — three buffers are live at once: the plaintext frame, the AES-GCM ciphertext output, and the accumulated part bytearray (plus the transport body). Measured withtracemalloc:With the budget undercounted ~3×, the 64MB governor admitted ~8 concurrent 16MB parts → OOM.
Fix (root cause: make the governor honest)
crypto.streaming_upload_peak()— the framed path's real peak (2·part + 2·frame).estimate_memory_footprint()reserves that instead of the bare part size, so the limiter can no longer admit more than fits.No deploy-config change. Confirmed from logs that 100% of traffic is 16MB ES parts; their honest reservation (32MB) fits the existing 64MB budget with 2 concurrent uploads (~50MB real + base RSS, well under 512Mi). The governor budget is managed in GitOps and is left untouched.
Things I tried and rejected
Why earlier tests missed it
footprint == memory_bounded_part_size(cl)) — they encoded the bug as the spec.UNSIGNED-PAYLOADand part sizes >32MB, which always took the already-safe path; none drove a 16MB signed part and compared reserved-vs-actual memory.New
test_upload_reservation_vs_real.pydrives the real_stream_and_upload_framedundertracemallocand assertsreserved >= measured peak(and that the old part-size reservation was provably insufficient). It fails if the reservation drifts below reality or the path starts allocating more.All 436 unit tests pass.