feat: merge-train/spartan-v5#24454
Queued
AztecBot wants to merge 4 commits into
Queued
Conversation
…failures (#24426) ## Problem In the docs-examples CI job (`docs/examples/bootstrap.sh execute`), `example_swap` and `recursive_verification` exhausted all 5 retry attempts on the same signature during `.send()` fee estimation: ``` Error: An unknown error occurred while executing the contract function "getTimestampForSlot". args: (54) Details: L1 RPC request failed code: -32702 ``` Stack: `sendTx -> completeFeeOptions -> getMinFees -> aztecNode.getPredictedMinFees / getCurrentMinFees` (over JSON-RPC to the node). The failing reads are node-side, in `FeeProviderImpl.computeCurrentMinFees` (`getTimestampForSlot` / `getManaMinFeeAt`) and `FeePredictor.fetchState`. Failing run: http://ci.aztec-labs.com/66ef70e6044f13bd (branch `merge-train/spartan-v5`, commit `0e1827b945`). ## Root cause `FeeProviderImpl.getCurrentMinFees` and `FeePredictor.getState` cache a promise keyed on the L1 block number, refreshing only when the block advances. When the cached computation rejects (a transient `L1 RPC request failed` blip against the single 1-CPU anvil the docs-examples job runs), the rejected promise is replayed on every subsequent call at the same L1 block. `getCurrentMinFees` was worse: it chained the next computation via `currentMinFees.then(() => computeCurrentMinFees())`. Once `currentMinFees` is rejected, `.then(onFulfilled)` propagates the rejection and never invokes `computeCurrentMinFees` again — so the cache stays poisoned permanently, regardless of L1 block advancement, until the node restarts. Because fee estimation fails before any L1 tx is submitted, no new L1 block is mined by that path, so a single blip permanently wedges `.send()` on the node. The CI log confirms this: after one transient failure on `getManaMinFeeAt`, `getTimestampForSlot(54)` failed identically across all 5 retries of both examples — even though the node kept mining L1 blocks (148-152) for the examples' own contract deploys in the same second. This rules out a live node/anvil stall and pinpoints the poisoned promise cache. ## Fix - `getCurrentMinFees`: chain off the previous promise's settlement (via a swallowing `.catch`) rather than its fulfillment, so a prior rejection no longer short-circuits the new computation; and reset the cached L1 block number when the computation rejects so the next call recomputes at the same block. - `getState` (FeePredictor): reset the cached L1 block number on rejection for the same reason. Both are idempotent read paths, so recomputing on the next call is safe. ## Testing - Added unit tests to `fee_provider.test.ts` and `fee_predictor.test.ts` that fail (red) against the current code and pass (green) with the fix: a transient rejection followed by a call at the same L1 block must recompute rather than replay the cached rejection. - Red/green verified locally: both new tests fail on the unpatched source and the full `sequencer-client` fee suites pass (13 tests) with the fix. - A deterministic docs-examples repro was not feasible (it depends on a transient L1 RPC blip under CI contention), so the tests target the poisoning mechanism directly.
…24441) Fixes a genuine timing flake in `single-node/proving/proof_fails.parallel.test.ts` › `does not allow submitting proof after epoch end`, observed in CI run `99ed463db48b9cf2` on the v5 line (`Expected: 2, Received: 0` at the final epoch assertion). This test ran into a situation where the proposer `prune`d the unproven chain without `propose`ing a new checkpoint, so the check for "has the last checkpoint been mined in epoch 2" failed, since there was _no checkpoint at all_ in the entire chain. This fixes it so we wait until there's actually a checkpoint mined. In parallel, I'm looking into _why_ the proposer decided to prune without proposing.
Any commits made after this event will not be merged.
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(sequencer): recover fee-estimation caches from transient L1 read failures (#24426)
test(e2e): fix epoch-2 tip race in proof_fails after-epoch-end test (#24441)
END_COMMIT_OVERRIDE