Retry transient full-scan upload failures (502/503/504/408, dropped connections) by mtorp · Pull Request #232 · SocketDev/socket-python-cli

Martin Torp (mtorp) · 2026-06-10T08:20:46Z

Problem

The full-scan upload (POST /v0/orgs/<org>/full-scans) makes exactly one attempt. When it fails, the entire run dies — for --reach runs that means a completed multi-hour tier-1 reachability analysis produces no scan and no report.

Uploads can fail transiently at the gateway/connection level — an HTTP 502/503/504/408, a dropped or reset connection, or a client-side timeout — without the server having created the scan. A retried upload almost always succeeds.

Change

Core.create_full_scan now retries the fullscans.post call:

Up to 3 total attempts, waiting ~10s before attempt 2 and ~30s before attempt 3 (plus up to 2s jitter so CI fleets don't retry in lock-step), with a log.warning before each retry.
Retries transient failures only, as classified by the SDK: requires socketdev>=3.3.0 (Add transient-error classification to APIFailure socket-sdk-python#93), where do_request records the HTTP status code on every exception it raises and APIFailure.is_transient_error() owns the classification — true for HTTP 408/502/503/504, dropped/reset connections, and client-side timeouts. The CLI encodes no knowledge of the SDK exception hierarchy and parses no message text, so SDK restructuring can't silently break the classification.
Never retried: deterministic errors (400/401/403/404/429), responses that came back with an error payload, and unexpected errors wrapped in a bare APIFailure.

Note: uv.lock is regenerated once socketdev 3.3.0 is released to PyPI (blocked on SocketDev/socket-sdk-python#93); CI is expected to fail on dependency resolution until then.

Safety

No duplicate scans: in these failure modes the server never finished reading the request body, so no scan was created — a retry cannot duplicate one. In the rare case where a gateway timeout races a request the server later completes, the extra scan is benign and superseded by the retry (as if the CLI had run twice).
Files are rebuilt per attempt: fullscans.post(use_lazy_loading=True) rebuilds its LazyFileLoaders from the plain paths on every call, so re-invoking it is safe (the loaders from a failed attempt are exhausted/closed and must not be reused).
Temp .br facts files survive retries: the retry loop lives inside the existing try/finally, so the brotli-compressed .socket.facts.json.br siblings are only cleaned up after the last attempt.

Testing

New tests/unit/test_full_scan_retry.py (16 tests): retry on 502 then success, exhausting attempts re-raises, retry on catch-all 408/503/504, retry on connection-level errors, no retry on 400 / dedicated 4xx classes / error payloads, .br temp-file lifecycle across retries (success and exhaustion), and that the retry decision delegates to APIFailure.is_transient_error() (the classification truth table itself is tested in the SDK, next to the code that raises). Full suite against socketdev 3.3.0: 364 passed, 2 pre-existing skips.

github-actions · 2026-06-10T08:21:34Z

🚀 Preview package published!

Install with:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple socketsecurity==2.4.8.dev6

Docker image: socketdev/cli:pr-232

…onnections) A full-scan upload can fail transiently at the gateway/connection level - an HTTP 502/503/504/408, a dropped or reset connection, or a client-side timeout - without the server having created the scan. The CLI previously made exactly one attempt, so an entire run (including a completed reachability analysis) died on a single transient failure even though a retried upload almost always succeeds. create_full_scan now retries the fullscans POST up to 3 total attempts with increasing waits (~10s, then ~30s, plus jitter) on transient failures only: APIBadGateway (502), APIConnectionError, APITimeout, and catch-all APIFailure whose embedded original_status_code is 408/503/504. Dedicated 4xx classes, catch-all 400s, and error payloads are never retried. In these failure modes the server never finished reading the request body, so no scan was created and a retry does not duplicate one; in the rare case where a gateway timeout races a request the server later completes, the extra scan is benign and superseded by the retry (as if the CLI had run twice). The retry loop lives inside the existing try/finally so the brotli-compressed .socket.facts.json.br temp files survive until every attempt has finished; fullscans.post rebuilds its lazy file loaders from the plain paths on every call, so re-invoking it per attempt is safe. Assisted-by: Claude Code:claude-opus-4-8

Oskar Haarklou Veileborg (BarrensZeppelin)

LGTM

Address review feedback on the upload retry: - The retry decision now delegates to APIFailure.is_transient_error() (socketdev>=3.3.0, SocketDev/socket-sdk-python#93), which classifies by the HTTP status code the SDK records when raising. The CLI no longer encodes the SDK's exception hierarchy or parses status codes out of message text, so SDK restructuring can't silently break the classification. - The backoff schedule is now the single source of truth for the loop: FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS = (10.0, 30.0, None), where each entry is the wait before the next attempt and the final None re-raises instead of retrying. FULL_SCAN_UPLOAD_MAX_ATTEMPTS is computed from its length. Note: uv.lock is intentionally not regenerated yet - socketdev 3.3.0 must be released to PyPI first (blocked on socket-sdk-python#93).

* Add transient-error classification to APIFailure API.do_request now records the HTTP status code on every exception it raises (status_code attribute), and APIFailure gains is_transient_error(): True for gateway/connection-level failures (HTTP 408/502/503/504, dropped or reset connections, client-side timeouts) where retrying the same request may succeed, False for deterministic errors (400/401/403/404/429, wrapped unexpected errors). Classification is based on the recorded status code rather than exception class identity or message text, so it stays correct if a status code gains a dedicated subclass later. Motivated by SocketDev/socket-python-cli#232: the CLI retries transient full-scan upload failures and previously had to parse the status code out of catch-all APIFailure message text. * Hardcode 502 in APIBadGateway instead of accepting a status_code The class definitionally represents a 502, so there is no reason for construction sites to pass (or be able to override) the status.

socketdev 3.3.0 is now released, unblocking the >=3.3.0 floor bump.

socket-security · 2026-06-10T11:44:16Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pypi/socketdev@3.2.1 ⏵ 3.3.0	⁺¹

View full report

socket-security-staging · 2026-06-10T11:44:38Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pypi/socketdev@3.2.1 ⏵ 3.3.0	⁺¹

View full report

Martin Torp (mtorp) temporarily deployed to socket-firewall June 10, 2026 08:20 — with GitHub Actions Inactive

Martin Torp (mtorp) temporarily deployed to socket-firewall June 10, 2026 09:07 — with GitHub Actions Inactive

Martin Torp (mtorp) force-pushed the retry-transient-full-scan-upload-failures branch from 5d25c74 to 320dcb8 Compare June 10, 2026 09:12

Martin Torp (mtorp) temporarily deployed to socket-firewall June 10, 2026 09:12 — with GitHub Actions Inactive

Martin Torp (mtorp) force-pushed the retry-transient-full-scan-upload-failures branch from 320dcb8 to d32806b Compare June 10, 2026 09:15

Martin Torp (mtorp) temporarily deployed to socket-firewall June 10, 2026 09:15 — with GitHub Actions Inactive

Martin Torp (mtorp) commented Jun 10, 2026

View reviewed changes

Comment thread socketsecurity/core/__init__.py Outdated

docs: drop the 'retry almost always succeeds' claim from retry comments

8dc0970

Martin Torp (mtorp) temporarily deployed to socket-firewall June 10, 2026 09:42 — with GitHub Actions Inactive

Martin Torp (mtorp) marked this pull request as ready for review June 10, 2026 09:43

Martin Torp (mtorp) requested a review from a team as a code owner June 10, 2026 09:43

Martin Torp (mtorp) requested a review from Oskar Haarklou Veileborg (BarrensZeppelin) June 10, 2026 09:43

Martin Torp (mtorp) temporarily deployed to socket-firewall June 10, 2026 09:43 — with GitHub Actions Inactive

Oskar Haarklou Veileborg (BarrensZeppelin) approved these changes Jun 10, 2026

View reviewed changes

Comment thread socketsecurity/core/__init__.py Outdated

Comment thread socketsecurity/core/__init__.py Outdated

Martin Torp (mtorp) mentioned this pull request Jun 10, 2026

Add transient-error classification to APIFailure SocketDev/socket-sdk-python#93

Merged

Martin Torp (mtorp) had a problem deploying to socket-firewall June 10, 2026 11:26 — with GitHub Actions Failure

Lock socketdev 3.3.0

6bc8fe8

socketdev 3.3.0 is now released, unblocking the >=3.3.0 floor bump.

Martin Torp (mtorp) temporarily deployed to socket-firewall June 10, 2026 11:44 — with GitHub Actions Inactive

Martin Torp (mtorp) merged commit 1af7934 into main Jun 10, 2026
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry transient full-scan upload failures (502/503/504/408, dropped connections)#232

Retry transient full-scan upload failures (502/503/504/408, dropped connections)#232
Martin Torp (mtorp) merged 4 commits into
mainfrom
retry-transient-full-scan-upload-failures

Martin Torp (mtorp) commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Oskar Haarklou Veileborg (BarrensZeppelin) left a comment

Uh oh!

Uh oh!

Uh oh!

socket-security Bot commented Jun 10, 2026

Uh oh!

socket-security-staging Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Martin Torp (mtorp) commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Safety

Testing

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Oskar Haarklou Veileborg (BarrensZeppelin) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

socket-security Bot commented Jun 10, 2026

Uh oh!

socket-security-staging Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Martin Torp (mtorp) commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading