Skip to content

Retry transient full-scan upload failures (502/503/504/408, dropped connections)#232

Merged
Martin Torp (mtorp) merged 4 commits into
mainfrom
retry-transient-full-scan-upload-failures
Jun 10, 2026
Merged

Retry transient full-scan upload failures (502/503/504/408, dropped connections)#232
Martin Torp (mtorp) merged 4 commits into
mainfrom
retry-transient-full-scan-upload-failures

Conversation

@mtorp

@mtorp Martin Torp (mtorp) commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

The full-scan upload (POST /v0/orgs/<org>/full-scans) makes exactly one attempt. When it fails, the entire run dies — for --reach runs that means a completed multi-hour tier-1 reachability analysis produces no scan and no report.

Uploads can fail transiently at the gateway/connection level — an HTTP 502/503/504/408, a dropped or reset connection, or a client-side timeout — without the server having created the scan. A retried upload almost always succeeds.

Change

Core.create_full_scan now retries the fullscans.post call:

  • Up to 3 total attempts, waiting ~10s before attempt 2 and ~30s before attempt 3 (plus up to 2s jitter so CI fleets don't retry in lock-step), with a log.warning before each retry.
  • Retries transient failures only, as classified by the SDK: requires socketdev>=3.3.0 (Add transient-error classification to APIFailure socket-sdk-python#93), where do_request records the HTTP status code on every exception it raises and APIFailure.is_transient_error() owns the classification — true for HTTP 408/502/503/504, dropped/reset connections, and client-side timeouts. The CLI encodes no knowledge of the SDK exception hierarchy and parses no message text, so SDK restructuring can't silently break the classification.
  • Never retried: deterministic errors (400/401/403/404/429), responses that came back with an error payload, and unexpected errors wrapped in a bare APIFailure.

Note: uv.lock is regenerated once socketdev 3.3.0 is released to PyPI (blocked on SocketDev/socket-sdk-python#93); CI is expected to fail on dependency resolution until then.

Safety

  • No duplicate scans: in these failure modes the server never finished reading the request body, so no scan was created — a retry cannot duplicate one. In the rare case where a gateway timeout races a request the server later completes, the extra scan is benign and superseded by the retry (as if the CLI had run twice).
  • Files are rebuilt per attempt: fullscans.post(use_lazy_loading=True) rebuilds its LazyFileLoaders from the plain paths on every call, so re-invoking it is safe (the loaders from a failed attempt are exhausted/closed and must not be reused).
  • Temp .br facts files survive retries: the retry loop lives inside the existing try/finally, so the brotli-compressed .socket.facts.json.br siblings are only cleaned up after the last attempt.

Testing

New tests/unit/test_full_scan_retry.py (16 tests): retry on 502 then success, exhausting attempts re-raises, retry on catch-all 408/503/504, retry on connection-level errors, no retry on 400 / dedicated 4xx classes / error payloads, .br temp-file lifecycle across retries (success and exhaustion), and that the retry decision delegates to APIFailure.is_transient_error() (the classification truth table itself is tested in the SDK, next to the code that raises). Full suite against socketdev 3.3.0: 364 passed, 2 pre-existing skips.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

🚀 Preview package published!

Install with:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple socketsecurity==2.4.8.dev6

Docker image: socketdev/cli:pr-232

@mtorp Martin Torp (mtorp) force-pushed the retry-transient-full-scan-upload-failures branch from 5d25c74 to 320dcb8 Compare June 10, 2026 09:12
…onnections)

A full-scan upload can fail transiently at the gateway/connection level -
an HTTP 502/503/504/408, a dropped or reset connection, or a client-side
timeout - without the server having created the scan. The CLI previously
made exactly one attempt, so an entire run (including a completed
reachability analysis) died on a single transient failure even though a
retried upload almost always succeeds.

create_full_scan now retries the fullscans POST up to 3 total attempts with
increasing waits (~10s, then ~30s, plus jitter) on transient failures only:
APIBadGateway (502), APIConnectionError, APITimeout, and catch-all APIFailure
whose embedded original_status_code is 408/503/504. Dedicated 4xx classes,
catch-all 400s, and error payloads are never retried. In these failure modes
the server never finished reading the request body, so no scan was created
and a retry does not duplicate one; in the rare case where a gateway timeout
races a request the server later completes, the extra scan is benign and
superseded by the retry (as if the CLI had run twice).

The retry loop lives inside the existing try/finally so the brotli-compressed
.socket.facts.json.br temp files survive until every attempt has finished;
fullscans.post rebuilds its lazy file loaders from the plain paths on every
call, so re-invoking it per attempt is safe.

Assisted-by: Claude Code:claude-opus-4-8
@mtorp Martin Torp (mtorp) force-pushed the retry-transient-full-scan-upload-failures branch from 320dcb8 to d32806b Compare June 10, 2026 09:15
Comment thread socketsecurity/core/__init__.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread socketsecurity/core/__init__.py Outdated
Comment thread socketsecurity/core/__init__.py Outdated
Address review feedback on the upload retry:

- The retry decision now delegates to APIFailure.is_transient_error()
  (socketdev>=3.3.0, SocketDev/socket-sdk-python#93), which classifies
  by the HTTP status code the SDK records when raising. The CLI no
  longer encodes the SDK's exception hierarchy or parses status codes
  out of message text, so SDK restructuring can't silently break the
  classification.
- The backoff schedule is now the single source of truth for the loop:
  FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS = (10.0, 30.0, None), where
  each entry is the wait before the next attempt and the final None
  re-raises instead of retrying. FULL_SCAN_UPLOAD_MAX_ATTEMPTS is
  computed from its length.

Note: uv.lock is intentionally not regenerated yet - socketdev 3.3.0
must be released to PyPI first (blocked on socket-sdk-python#93).
Martin Torp (mtorp) added a commit to SocketDev/socket-sdk-python that referenced this pull request Jun 10, 2026
* Add transient-error classification to APIFailure

API.do_request now records the HTTP status code on every exception it
raises (status_code attribute), and APIFailure gains
is_transient_error(): True for gateway/connection-level failures
(HTTP 408/502/503/504, dropped or reset connections, client-side
timeouts) where retrying the same request may succeed, False for
deterministic errors (400/401/403/404/429, wrapped unexpected errors).

Classification is based on the recorded status code rather than
exception class identity or message text, so it stays correct if a
status code gains a dedicated subclass later.

Motivated by SocketDev/socket-python-cli#232: the CLI retries transient
full-scan upload failures and previously had to parse the status code
out of catch-all APIFailure message text.

* Hardcode 502 in APIBadGateway instead of accepting a status_code

The class definitionally represents a 502, so there is no reason for
construction sites to pass (or be able to override) the status.
socketdev 3.3.0 is now released, unblocking the >=3.3.0 floor bump.
@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​socketdev@​3.2.1 ⏵ 3.3.098 +1100100100100

View full report

@socket-security-staging

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​socketdev@​3.2.1 ⏵ 3.3.098 +1100100100100

View full report

@mtorp Martin Torp (mtorp) merged commit 1af7934 into main Jun 10, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants