Skip to content

perf(e2e): warp proven-checkpoint waits, tighten in-process polling, and instrument setup spans#24452

Open
spalladino wants to merge 7 commits into
merge-train/spartan-v5from
spl/e2e-speed-up-3
Open

perf(e2e): warp proven-checkpoint waits, tighten in-process polling, and instrument setup spans#24452
spalladino wants to merge 7 commits into
merge-train/spartan-v5from
spl/e2e-speed-up-3

Conversation

@spalladino

@spalladino spalladino commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Motivation

Span-instrumented full CI runs (since #24407 landed) now show where the remaining e2e wall-clock goes: a large uninstrumented setup layer, a cluster of proven-checkpoint/epoch waits that burn real time while nothing is being built, and a 1s in-process poll cadence that adds ~0.5-0.9s of dead sleep to nearly every awaited tx. This PR takes the low-risk warp folds and polling wins that are safe today, and adds span instrumentation to the setup layer so the next round can attack the ~5,000s of currently-invisible beforeAll work. Everything multi-node is validated on CI (hence ci-no-fail-fast); the last commit is a deliberate slot-cut experiment kept isolated so it can be reverted on its own.

Changes

  • poll in-process nodes at 250ms instead of 1s — TestWallet now defaults its send().wait() poll interval to 0.25s (in-process nodes reach CHECKPOINTED synchronously under automine and cheaply otherwise), with the spartan/worker sites explicitly restored to the 1s default since they talk to remote JSON-RPC nodes. The e2e-local wait helpers (wait_helpers.ts, waitForProvenChain, gas-portal block advance, L1->L2 message poll) drop from 1s to 0.25s. Removes ~0.5-0.9s of dead sleep per awaited tx across ~600 tests. Production aztec.js/wallet-sdk defaults are untouched.
  • warp past the epoch boundary in waitForProvenCheckpoint — after the multi-node block-production fixture stops its sequencers, warp the L1 clock one epoch forward (forward-only, skipped if already proven) so the epoch closes and the fake prover can prove+submit without waiting the epoch out in real time. Targets ~716s of suite-summed wait:proven-checkpoint across proposed_chain (~430s), deploy_and_call_ordering (~143s), cross_chain_messages (~143s), plus blob_promotion.
  • warp epoch waits in multi_proof and upload_failed_proof — replace waitUntilEpochStarts with warpToEpochStart in these two 12s-slot proving tests (both passed under warp in the round-1 CI sweep). proof_fails is deliberately left untouched.
  • register-only TestContract in automine pxe test — the test only calls the noinitcheck private emit_nullifier, so register the contract instead of deploying it, dropping a deployment tx and its checkpoint cycle.
  • compute genesis values on an ephemeral world stategenerateGenesisValues used a full fsync-on NativeWorldStateService.tmp per e2e container just to read one tree root; switch to the fsync-off ephemeral backend. Adds a unit test asserting tmp and ephemeral produce identical archive roots for a funded-accounts genesis with non-zero timestamp (this path is consensus-critical — CLI deploy paths compute the on-chain genesis root through it).
  • instrument setup-layer deploys and mints with spans — wrap the top-offender setup helpers (fees harness token/FPC deploys + mints + fee-juice bridge, cross-chain token/bridge deploys + mints, shared auth-registry publish, gas-portal bridge) in testSpan under the test(e2e): instrument common spans for wall-clock tracking #24407 taxonomy (deploy:*, tx:mint, setup:bridge, setup:auth-registry). Zero behavior change (testSpan is a passthrough without TEST_TIMING_FILE); this is the data source for round 4's attack on the ~5,000s of uninstrumented beforeAll work.
  • cut multi_validator_node slots 36s -> 16s (experiment, last commit) — lower aztecSlotDuration 36->16 and blockDurationMs 6000->2000 together (eth stays 8). Small expected saving; kept as the final isolated commit so it can be reverted alone if CI shows committee/attestation trouble on this file.

One planned item was dropped: an opt-in warp for ChainMonitor.waitUntilL2Slot. All three candidate call sites turned out to cover deliberate real-time building (live-sequencer coordination, inactivity accumulation across an epoch, and the proof-boundary critical window), so the opt-in API would have had no safe callers.

Verification

Locally: full yarn build, yarn format --check, and yarn lint on the touched packages all pass. The new world-state genesis-equivalence unit test passes (tmp and ephemeral roots identical). The automine/pxe.test.ts e2e passes as a smoke test for the register-only change and the 250ms polling. Everything multi-node (the warp folds, the timing cut) is validated on CI. Note the final commit is a deliberate slot-cut experiment that can be reverted in isolation.

Measured impact

Full green CI run of this PR (9b4cc967, CI 1782961631439529) vs the base-proxy full run (d160265b, CI 1782938936852228 — the branch point plus one unrelated one-file test change). Identical test populations: 2051 rows, all passed, in both runs. Sums are across parallel processes, not wall-clock (methodology in the Linear "Times tracking" doc).

Bucket Base This PR Δ
Overall 7h 26m 03s 6h 36m 20s −49m 43s (−11.1%)
Setup (before-hooks) 2h 14m 31s 1h 45m 44s −21.4%
— of which setup.ts 42m 53s 34m 13s −20.2%
Body 5h 05m 14s 4h 45m 03s −6.6%
Teardown 6m 17s 5m 33s −11.6%

By mechanism:

  • Proven-checkpoint warp: wait:proven-checkpoint 14m 12s → 3m 32s (−75%), worst single wait 215s → 79s. Suite deltas match span deltas ~1:1 — proposed_chain −67%, deploy_and_call_ordering −45%, cross_chain_messages −21%. Hard attribution.
  • Epoch warps: multi_proof −52% (wait:epoch 92s → 40s). upload_failed_proof's warp also worked (35s → 20s) but was masked in the suite total by one-shot proving noise in that run.
  • 250ms polling + ephemeral genesis: ~30–35 min spread across the whole suite — 141 of 152 suites improved; every setup:env:* span dropped ~16–50% with identical counts (same work, less waiting).
  • Slot-cut experiment: multi_validator_node 103s → 84s (−18.6%) and passed CI — the experiment survives.
  • Register-only pxe test: 18.2s → 10.1s.
  • New setup spans: ~44 min/run of previously untagged setup time is now attributable (setup:auth-registry 16m, setup:bridge 10m, tx:mint 8m, deploy:token 7m, deploy:fpc 2m) — the target list for round 4.

11 suites regressed (~8 min total vs ~58 min of improvements), all in untagged real-time slashing/proving bodies whose tagged waits are unchanged — consistent with run-to-run variance, not PR effects (per-suite noise floor between same-day runs is median ~11s / p90 ~52s).

@spalladino spalladino added ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure S-do-not-merge Status: Do not merge this PR labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure S-do-not-merge Status: Do not merge this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant