CLDSRV-918: raise SDK request timeout in functional test client#6187
CLDSRV-918: raise SDK request timeout in functional test client#6187tcarmet wants to merge 2 commits into
Conversation
Hello tcarmet,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
❌ 5 Tests Failed:
View the top 2 failed test(s) by shortest run time
View the full list of 6 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
|
LGTM |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
|
LGTM |
The SDK v2 test config set no timeout, so all tests ran with the v2 default of 120s (aws-sdk lib/config.js httpOptions.timeout). The v3 migration introduced a 5s requestTimeout, which heavy server-side operations (multi-megabyte UploadPartCopy, 1000-key batch deletes) exceed under CI load since the server sends no response bytes until they complete. Raise it to 30s: enough headroom for heavy operations, while staying below mocha's 40s global timeout so a genuine hang still surfaces as the SDK TimeoutError.
28c72f9 to
f908622
Compare
|
LGTM |
Pulls the ARSN-594 fix for the retrieveData double-callback that crashes the server when a client connection drops mid-stream, which the raised SDK request timeout surfaced in the file-ft test suites.
| "@hapi/joi": "^17.1.1", | ||
| "@smithy/node-http-handler": "^3.0.0", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#8.4.4", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#77ae4021c8b294b07e3f34818cecbe6ae7d9d1cb", |
There was a problem hiding this comment.
Arsenal is pinned to a commit hash instead of a tag. Git-based dependencies should be pinned to a released tag. Tags 8.4.5 and 8.4.6 already exist — if the ARSN-594 fix is included in one of them, pin to that tag instead.
— Claude Code
|
| // The SDK v2 config set no timeout, so tests ran with the v2 default of | ||
| // 120s; the v3 migration introduced a 5s timeout that heavy server-side | ||
| // operations (multi-megabyte UploadPartCopy, 1000-key batch deletes) exceed | ||
| // under CI load, as the server sends no response bytes until they complete. | ||
| // 30s stays below mocha's 40s global timeout so a genuine hang still | ||
| // surfaces as the SDK TimeoutError rather than a generic mocha timeout. |
There was a problem hiding this comment.
I would argue that when we read this comment in the future, we will not care about the history (migration changed it to 5s), but only about why we chose 30s as the value.
The AWS SDK v3 migration introduced a 5-second request timeout in the functional test client where SDK v2 previously ran with its 120-second default. Test operations that get no response bytes until the server finishes working - a 50MB server-side part copy, 1000-key batch deletes - regularly exceed 5 seconds of socket silence under CI load, producing
Connection timed out after 5000 msfailures on more than a dozen unrelated branches over the past 60 days, including one merge-queue hit.Raising the timeout to 30 seconds restores the pre-migration behavior within mocha's 40-second global cap: a genuine hang still fails the test, surfacing as the SDK's informative
TimeoutErrorinstead of a generic mocha timeout.Why this also bumps arsenal (ARSN-594, scality/Arsenal#2640):
Raising the timeout uncovered a latent server-crash bug. In
@smithy/node-http-handler, arequestTimeoutbelow 6s is registered throughrequest.setTimeout(), which Node cleans up when the request completes. At 6s and above, registration is deferred by 3s; for any request lasting longer than 3s the timer then attaches directly to the socket (socket.setTimeout()) and is never removed when the request finishes. With a keep-alive agent, the pooled socket keeps that stale timer and its handler later destroys the connection out from under whichever request is using it - so the 30s config turns every >3s test request into a delayed, random mid-stream disconnect.A client vanishing mid-stream then hit a latent bug in arsenal's
retrieveData: itsonce()guard around the iteration callback was created per data-layer callback invocation rather than per iteration, so a double callback during the teardown race ran the iteration callback twice, raisingCallback was already calledas an uncaughtException and killing the server. This crashed thefile-ft-testssuites deterministically (3 out of 3 runs, 200+ cascading ECONNREFUSED failures each). The arsenal fix makes the guard iteration-scoped; a client disconnect can no longer take the server down.Reference for the previous 120s behavior:
aws-sdk@2.1692.0shipshttpOptions: { timeout: 120000 }(lib/config.js#L626-L628), documented as "Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000)" (AWS.Config docs).tests/functional/aws-node-sdk/test/support/config.jsatea4659029configured no timeout at all, so every functional test ran with the 120s default.