Skip to content

CLDSRV-918: raise SDK request timeout in functional test client#6187

Draft
tcarmet wants to merge 2 commits into
development/9.3from
bugfix/CLDSRV-918-raise-test-request-timeout
Draft

CLDSRV-918: raise SDK request timeout in functional test client#6187
tcarmet wants to merge 2 commits into
development/9.3from
bugfix/CLDSRV-918-raise-test-request-timeout

Conversation

@tcarmet

@tcarmet tcarmet commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

The AWS SDK v3 migration introduced a 5-second request timeout in the functional test client where SDK v2 previously ran with its 120-second default. Test operations that get no response bytes until the server finishes working - a 50MB server-side part copy, 1000-key batch deletes - regularly exceed 5 seconds of socket silence under CI load, producing Connection timed out after 5000 ms failures on more than a dozen unrelated branches over the past 60 days, including one merge-queue hit.

Raising the timeout to 30 seconds restores the pre-migration behavior within mocha's 40-second global cap: a genuine hang still fails the test, surfacing as the SDK's informative TimeoutError instead of a generic mocha timeout.

Why this also bumps arsenal (ARSN-594, scality/Arsenal#2640):

Raising the timeout uncovered a latent server-crash bug. In @smithy/node-http-handler, a requestTimeout below 6s is registered through request.setTimeout(), which Node cleans up when the request completes. At 6s and above, registration is deferred by 3s; for any request lasting longer than 3s the timer then attaches directly to the socket (socket.setTimeout()) and is never removed when the request finishes. With a keep-alive agent, the pooled socket keeps that stale timer and its handler later destroys the connection out from under whichever request is using it - so the 30s config turns every >3s test request into a delayed, random mid-stream disconnect.

A client vanishing mid-stream then hit a latent bug in arsenal's retrieveData: its once() guard around the iteration callback was created per data-layer callback invocation rather than per iteration, so a double callback during the teardown race ran the iteration callback twice, raising Callback was already called as an uncaughtException and killing the server. This crashed the file-ft-tests suites deterministically (3 out of 3 runs, 200+ cascading ECONNREFUSED failures each). The arsenal fix makes the guard iteration-scoped; a client disconnect can no longer take the server down.

Reference for the previous 120s behavior:

@bert-e

bert-e commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Hello tcarmet,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e

bert-e commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Incorrect fix version

The Fix Version/s in issue CLDSRV-918 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.10

  • 9.4.0

Please check the Fix Version/s of CLDSRV-918, or the target
branch of this pull request.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

❌ 5 Tests Failed:

Tests completed Failed Passed Skipped
8484 5 8479 0
View the top 2 failed test(s) by shortest run time
should log correct multipartDelete operation with all required fields::Server Access Logs - File Output With v4 signature should log correct multipartDelete operation with all required fields
Stack Traces | 0.045s run time
Expected 3 log entries, got 4

4 !== 3
should log correct objectPutCopyPart operation with all required fields::Server Access Logs - File Output With v4 signature should log correct objectPutCopyPart operation with all required fields
Stack Traces | 0.06s run time
Expected 5 log entries, got 6

6 !== 5
View the full list of 6 ❄️ flaky test(s)
"after all" hook for "should batch delete objects where requester has permission"::Multi-Object Delete Access "after all" hook for "should batch delete objects where requester has permission"

Flake rate in main: 100.00% (Passed 0 times, Failed 7 times)

Stack Traces | 2.3s run time
The bucket you tried to delete is not empty.
"after each" hook for "should fail if trying to overwrite a delete marker"::MPU with x-scal-s3-version-id header With default signature "after each" hook for "should fail if trying to overwrite a delete marker"

Flake rate in main: 100.00% (Passed 0 times, Failed 11 times)

Stack Traces | 0.013s run time
We encountered an internal error. Please try again.
"after each" hook for "should fail if trying to overwrite a delete marker"::MPU with x-scal-s3-version-id header With v4 signature "after each" hook for "should fail if trying to overwrite a delete marker"

Flake rate in main: 100.00% (Passed 0 times, Failed 2 times)

Stack Traces | 0.017s run time
We encountered an internal error. Please try again.
"before all" hook for "should not delete locked objects"::Multi-Object Delete with Object Lock "before all" hook for "should not delete locked objects"

Flake rate in main: 100.00% (Passed 0 times, Failed 7 times)

Stack Traces | 0.006s run time
Object Lock configuration cannot be enabled on existing buckets
"before all" hook for "should return access denied error for each object where no acl permission"::Multi-Object Delete Access "before all" hook for "should return access denied error for each object where no acl permission"

Flake rate in main: 100.00% (Passed 0 times, Failed 7 times)

Stack Traces | 0.127s run time
socket hang up
should create a bunch of objects and their versions::put and head object with versioning With v4 signature on versioning suspended then enabled bucket w/ null version should create a bunch of objects and their versions

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 600s run time
Timeout of 600000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (.../test/versioning/objectHead.js)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

LGTM

Review by Claude Code

@bert-e

bert-e commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

LGTM

Review by Claude Code

The SDK v2 test config set no timeout, so all tests ran with the v2
default of 120s (aws-sdk lib/config.js httpOptions.timeout). The v3
migration introduced a 5s requestTimeout, which heavy server-side
operations (multi-megabyte UploadPartCopy, 1000-key batch deletes)
exceed under CI load since the server sends no response bytes until
they complete.

Raise it to 30s: enough headroom for heavy operations, while staying
below mocha's 40s global timeout so a genuine hang still surfaces as
the SDK TimeoutError.
@tcarmet tcarmet force-pushed the bugfix/CLDSRV-918-raise-test-request-timeout branch from 28c72f9 to f908622 Compare June 11, 2026 00:18
@tcarmet tcarmet changed the title CLDSRV-918: raise SDK request timeout for heavy-operation test suites CLDSRV-918: raise SDK request timeout in functional test client Jun 11, 2026
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM

Review by Claude Code

@tcarmet tcarmet requested a review from a team June 11, 2026 17:33
@tcarmet tcarmet marked this pull request as ready for review June 11, 2026 17:37
@tcarmet tcarmet requested a review from a team June 11, 2026 17:38
Pulls the ARSN-594 fix for the retrieveData double-callback that
crashes the server when a client connection drops mid-stream, which
the raised SDK request timeout surfaced in the file-ft test suites.
@tcarmet tcarmet marked this pull request as draft June 11, 2026 18:48
Comment thread package.json
"@hapi/joi": "^17.1.1",
"@smithy/node-http-handler": "^3.0.0",
"arsenal": "git+https://github.com/scality/Arsenal#8.4.4",
"arsenal": "git+https://github.com/scality/Arsenal#77ae4021c8b294b07e3f34818cecbe6ae7d9d1cb",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal is pinned to a commit hash instead of a tag. Git-based dependencies should be pinned to a released tag. Tags 8.4.5 and 8.4.6 already exist — if the ARSN-594 fix is included in one of them, pin to that tag instead.

— Claude Code

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
  • Arsenal dependency pinned to a commit hash instead of a tag (package.json). Tags 8.4.5 and 8.4.6 already exist — pin to a released tag instead.

    The timeout fix in config.js looks correct: 30s is well within mocha's 40s cap, the socketTimeout-to-requestTimeout rename aligns with the NodeHttpHandler API, and the constant avoids drift between the two handler configs.

    Review by Claude Code

Comment on lines +25 to +30
// The SDK v2 config set no timeout, so tests ran with the v2 default of
// 120s; the v3 migration introduced a 5s timeout that heavy server-side
// operations (multi-megabyte UploadPartCopy, 1000-key batch deletes) exceed
// under CI load, as the server sends no response bytes until they complete.
// 30s stays below mocha's 40s global timeout so a genuine hang still
// surfaces as the SDK TimeoutError rather than a generic mocha timeout.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that when we read this comment in the future, we will not care about the history (migration changed it to 5s), but only about why we chose 30s as the value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants