Retry transient full-scan upload failures (502/503/504/408, dropped connections)#232
Merged
Martin Torp (mtorp) merged 4 commits intoJun 10, 2026
Merged
Conversation
|
🚀 Preview package published! Install with: pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple socketsecurity==2.4.8.dev6Docker image: |
5d25c74 to
320dcb8
Compare
…onnections) A full-scan upload can fail transiently at the gateway/connection level - an HTTP 502/503/504/408, a dropped or reset connection, or a client-side timeout - without the server having created the scan. The CLI previously made exactly one attempt, so an entire run (including a completed reachability analysis) died on a single transient failure even though a retried upload almost always succeeds. create_full_scan now retries the fullscans POST up to 3 total attempts with increasing waits (~10s, then ~30s, plus jitter) on transient failures only: APIBadGateway (502), APIConnectionError, APITimeout, and catch-all APIFailure whose embedded original_status_code is 408/503/504. Dedicated 4xx classes, catch-all 400s, and error payloads are never retried. In these failure modes the server never finished reading the request body, so no scan was created and a retry does not duplicate one; in the rare case where a gateway timeout races a request the server later completes, the extra scan is benign and superseded by the retry (as if the CLI had run twice). The retry loop lives inside the existing try/finally so the brotli-compressed .socket.facts.json.br temp files survive until every attempt has finished; fullscans.post rebuilds its lazy file loaders from the plain paths on every call, so re-invoking it per attempt is safe. Assisted-by: Claude Code:claude-opus-4-8
320dcb8 to
d32806b
Compare
Oskar Haarklou Veileborg (BarrensZeppelin)
approved these changes
Jun 10, 2026
Oskar Haarklou Veileborg (BarrensZeppelin)
left a comment
Member
There was a problem hiding this comment.
LGTM
Address review feedback on the upload retry: - The retry decision now delegates to APIFailure.is_transient_error() (socketdev>=3.3.0, SocketDev/socket-sdk-python#93), which classifies by the HTTP status code the SDK records when raising. The CLI no longer encodes the SDK's exception hierarchy or parses status codes out of message text, so SDK restructuring can't silently break the classification. - The backoff schedule is now the single source of truth for the loop: FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS = (10.0, 30.0, None), where each entry is the wait before the next attempt and the final None re-raises instead of retrying. FULL_SCAN_UPLOAD_MAX_ATTEMPTS is computed from its length. Note: uv.lock is intentionally not regenerated yet - socketdev 3.3.0 must be released to PyPI first (blocked on socket-sdk-python#93).
Martin Torp (mtorp)
added a commit
to SocketDev/socket-sdk-python
that referenced
this pull request
Jun 10, 2026
* Add transient-error classification to APIFailure API.do_request now records the HTTP status code on every exception it raises (status_code attribute), and APIFailure gains is_transient_error(): True for gateway/connection-level failures (HTTP 408/502/503/504, dropped or reset connections, client-side timeouts) where retrying the same request may succeed, False for deterministic errors (400/401/403/404/429, wrapped unexpected errors). Classification is based on the recorded status code rather than exception class identity or message text, so it stays correct if a status code gains a dedicated subclass later. Motivated by SocketDev/socket-python-cli#232: the CLI retries transient full-scan upload failures and previously had to parse the status code out of catch-all APIFailure message text. * Hardcode 502 in APIBadGateway instead of accepting a status_code The class definitionally represents a 502, so there is no reason for construction sites to pass (or be able to override) the status.
socketdev 3.3.0 is now released, unblocking the >=3.3.0 floor bump.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The full-scan upload (
POST /v0/orgs/<org>/full-scans) makes exactly one attempt. When it fails, the entire run dies — for--reachruns that means a completed multi-hour tier-1 reachability analysis produces no scan and no report.Uploads can fail transiently at the gateway/connection level — an HTTP 502/503/504/408, a dropped or reset connection, or a client-side timeout — without the server having created the scan. A retried upload almost always succeeds.
Change
Core.create_full_scannow retries thefullscans.postcall:log.warningbefore each retry.socketdev>=3.3.0(Add transient-error classification to APIFailure socket-sdk-python#93), wheredo_requestrecords the HTTP status code on every exception it raises andAPIFailure.is_transient_error()owns the classification — true for HTTP 408/502/503/504, dropped/reset connections, and client-side timeouts. The CLI encodes no knowledge of the SDK exception hierarchy and parses no message text, so SDK restructuring can't silently break the classification.APIFailure.Safety
fullscans.post(use_lazy_loading=True)rebuilds itsLazyFileLoaders from the plain paths on every call, so re-invoking it is safe (the loaders from a failed attempt are exhausted/closed and must not be reused)..brfacts files survive retries: the retry loop lives inside the existingtry/finally, so the brotli-compressed.socket.facts.json.brsiblings are only cleaned up after the last attempt.Testing
New
tests/unit/test_full_scan_retry.py(16 tests): retry on 502 then success, exhausting attempts re-raises, retry on catch-all 408/503/504, retry on connection-level errors, no retry on 400 / dedicated 4xx classes / error payloads,.brtemp-file lifecycle across retries (success and exhaustion), and that the retry decision delegates toAPIFailure.is_transient_error()(the classification truth table itself is tested in the SDK, next to the code that raises). Full suite against socketdev 3.3.0: 364 passed, 2 pre-existing skips.