perf(e2e): warp proven-checkpoint waits, tighten in-process polling, and instrument setup spans#24452
Open
spalladino wants to merge 7 commits into
Open
perf(e2e): warp proven-checkpoint waits, tighten in-process polling, and instrument setup spans#24452spalladino wants to merge 7 commits into
spalladino wants to merge 7 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Span-instrumented full CI runs (since #24407 landed) now show where the remaining e2e wall-clock goes: a large uninstrumented setup layer, a cluster of proven-checkpoint/epoch waits that burn real time while nothing is being built, and a 1s in-process poll cadence that adds ~0.5-0.9s of dead sleep to nearly every awaited tx. This PR takes the low-risk warp folds and polling wins that are safe today, and adds span instrumentation to the setup layer so the next round can attack the ~5,000s of currently-invisible beforeAll work. Everything multi-node is validated on CI (hence
ci-no-fail-fast); the last commit is a deliberate slot-cut experiment kept isolated so it can be reverted on its own.Changes
send().wait()poll interval to 0.25s (in-process nodes reach CHECKPOINTED synchronously under automine and cheaply otherwise), with the spartan/worker sites explicitly restored to the 1s default since they talk to remote JSON-RPC nodes. The e2e-local wait helpers (wait_helpers.ts,waitForProvenChain, gas-portal block advance, L1->L2 message poll) drop from 1s to 0.25s. Removes ~0.5-0.9s of dead sleep per awaited tx across ~600 tests. Production aztec.js/wallet-sdk defaults are untouched.wait:proven-checkpointacross proposed_chain (~430s), deploy_and_call_ordering (~143s), cross_chain_messages (~143s), plus blob_promotion.waitUntilEpochStartswithwarpToEpochStartin these two 12s-slot proving tests (both passed under warp in the round-1 CI sweep). proof_fails is deliberately left untouched.emit_nullifier, so register the contract instead of deploying it, dropping a deployment tx and its checkpoint cycle.generateGenesisValuesused a full fsync-onNativeWorldStateService.tmpper e2e container just to read one tree root; switch to the fsync-offephemeralbackend. Adds a unit test assertingtmpandephemeralproduce identical archive roots for a funded-accounts genesis with non-zero timestamp (this path is consensus-critical — CLI deploy paths compute the on-chain genesis root through it).testSpanunder the test(e2e): instrument common spans for wall-clock tracking #24407 taxonomy (deploy:*,tx:mint,setup:bridge,setup:auth-registry). Zero behavior change (testSpan is a passthrough withoutTEST_TIMING_FILE); this is the data source for round 4's attack on the ~5,000s of uninstrumented beforeAll work.aztecSlotDuration36->16 andblockDurationMs6000->2000 together (eth stays 8). Small expected saving; kept as the final isolated commit so it can be reverted alone if CI shows committee/attestation trouble on this file.One planned item was dropped: an opt-in warp for
ChainMonitor.waitUntilL2Slot. All three candidate call sites turned out to cover deliberate real-time building (live-sequencer coordination, inactivity accumulation across an epoch, and the proof-boundary critical window), so the opt-in API would have had no safe callers.Verification
Locally: full
yarn build,yarn format --check, andyarn linton the touched packages all pass. The new world-state genesis-equivalence unit test passes (tmpandephemeralroots identical). Theautomine/pxe.test.tse2e passes as a smoke test for the register-only change and the 250ms polling. Everything multi-node (the warp folds, the timing cut) is validated on CI. Note the final commit is a deliberate slot-cut experiment that can be reverted in isolation.Measured impact
Full green CI run of this PR (
9b4cc967, CI1782961631439529) vs the base-proxy full run (d160265b, CI1782938936852228— the branch point plus one unrelated one-file test change). Identical test populations: 2051 rows, all passed, in both runs. Sums are across parallel processes, not wall-clock (methodology in the Linear "Times tracking" doc).By mechanism:
wait:proven-checkpoint14m 12s → 3m 32s (−75%), worst single wait 215s → 79s. Suite deltas match span deltas ~1:1 — proposed_chain −67%, deploy_and_call_ordering −45%, cross_chain_messages −21%. Hard attribution.setup:env:*span dropped ~16–50% with identical counts (same work, less waiting).setup:auth-registry16m,setup:bridge10m,tx:mint8m,deploy:token7m,deploy:fpc2m) — the target list for round 4.11 suites regressed (~8 min total vs ~58 min of improvements), all in untagged real-time slashing/proving bodies whose tagged waits are unchanged — consistent with run-to-run variance, not PR effects (per-suite noise floor between same-day runs is median ~11s / p90 ~52s).