fix: parallelize per-object HEAD on list-objects to stop recursive-list stalls#88
Open
ServerSideHannes wants to merge 2 commits into
Open
fix: parallelize per-object HEAD on list-objects to stop recursive-list stalls#88ServerSideHannes wants to merge 2 commits into
ServerSideHannes wants to merge 2 commits into
Conversation
…st stalls ListObjects resolved the SSE plaintext size/etag with one sequential head_object per key. A recursive (empty-delimiter) list of up to max-keys objects stacked into a multi-second stall that tripped client timeouts, hanging ClickHouse/Postgres remote backups at the S3 list step. Run the HEADs concurrently, bounded by LIST_HEAD_CONCURRENCY (50), preserving output order and the per-object fallback to the listed size/etag.
…mports except A, B: is Py2 syntax (catch A, bind to name B) and fails to parse under Py3 — it broke 'import s3proxy', so the whole pytest suite couldn't collect (conftest imports s3proxy.concurrency). Parenthesize the 8 sites across utils, concurrency, request_handler and dashboard/* to catch both types as intended. Lets the list-objects self-check import the real module directly (drops the sibling-stubbing workaround).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ClickHouse remote backups (
web-production-platform) hang at the S3 list step. The backup tool issues a recursive list (list-type=2&delimiter=) against the bucket via this proxy; the proxy never returns, the client aborts at ~60s (HAProxyCH--, timing…/-1/60034), and the backup pod sits idle.clickhouse-backup list remotealso times out. Folder-level lists (delimiter=/) return fast — that split is the tell.Root cause
_process_list_objectsins3proxy/handlers/buckets.pyissues onehead_objectper key, sequentially, to recover the SSE plaintext size/etag:A recursive list returns up to
max-keys(1000) objects per page → up to 1000 serial HEAD round-trips → latency grows linearly and trips the 60s client timeout. Delimited lists return few objects → fast. Borderline timing is why most hourly backups pass and some hang as the bucket grows.Fix
Run the per-object HEADs concurrently, bounded by
LIST_HEAD_CONCURRENCY = 50(half the client max_pool_connections=100). Output order and the per-object fallback (listed size/etag on HEAD failure) are preserved — purely a latency change, identical output.Also in this PR: unbreak the package import
The suite could not even collect: 8 sites used Py2
except A, B:syntax (utils.py,concurrency.py,request_handler.py,dashboard/collectors.py,dashboard/stats_store.py) which does not parse under Py3, soimport s3proxyfailed and CI was red on everything. Parenthesized them to catch both types as intended. This means a freshimport s3proxywas broken onmain— worth understanding how that shipped.Verification
tests/unit/test_list_objects_parallel.py(new) proves HEADs run concurrently (max_inflight > 1, capped at the limit), order preserved, internal keys dropped, failed HEAD falls back to listed size/etag.test_dashboard.py::test_session_cookie_authenticates_api(Queue ... bound to a different event loop) — is a pre-existing, unrelated dashboard test-infra bug; it fails identically in isolation and was only hidden by the import break. Not touched here; should be a separate fix.