fix(metrics): dedupe HA validator replicas in slot fill-rate alert#24453
Draft
AztecBot wants to merge 1 commit into
Draft
fix(metrics): dedupe HA validator replicas in slot fill-rate alert#24453AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
netlowSlotFillRatealert (Sequencer - low slot fill rate) fires spuriously on HA-deployed networks (e.g. next-net). It computesslot_filled_count / slot_total_count, andaztec_sequencer_slot_total_countis inflated ~2× whileslot_filled_countstays accurate — so the ratio reads ~50% and trips the 80% threshold even when every slot is filled.Observed on next-net: ~100 slots counted with ~50 filled over 1h, where
3600s / 72s = 50is the true slot count.Root cause
With
VALIDATOR_HA_REPLICAS > 0, each logical validator is deployed as a primary + HA pair that share attesters (e.g.validator-0andvalidator-ha-1-0). Both pods run the checkpoint build loop every slot, so both callSequencerMetrics.incOpenSlot()→aztec_sequencer_slot_total_count += 1(yarn-project/sequencer-client/src/sequencer/metrics.ts:278, from the build path atcheckpoint_proposal_job.ts:565).Only the pod that wins publish leader-election lands the checkpoint on L1 and calls
incFilledSlot()→aztec_sequencer_slot_filled_count += 1(checkpoint_proposal_job.ts:312). HA PostgreSQL coordination prevents the second pod from double-signing/publishing.Net effect per slot: 2 opens, 1 fill. The in-process
lastSeenSlotdedup inincOpenSlotcan't help — the two counts come from two separate pods. Confirmed in next-net logs: two distinct pods emit "Starting checkpoint proposal for slot N" per slot, one emits "Checkpoint published for slot N". No fisherman node is involved (fishermanMode=Falseon all validators).This also inflates the miss count (
total - filled), since a genuinely missed slot is built by both pods and published by neither.Fix
Both HA pods emit
aztec_sequencer_slot_total_countwith the sameaztec_block_proposerlabel (the on-chain selected proposer), differing only by pod/instance labels. Collapse them withmax by (k8s_namespace_name, aztec_block_proposer)before summing across proposers, in both the ratio denominator and the>= 5data-sufficiency guard:slot_filled_countis left as a plain sum — it is not doubled (only the publisher increments it). Non-HA namespaces are unaffected:max by (proposer)over a single pod is that pod's count.Notes / follow-ups
incOpenSloton the active leader pod sototal_countis one-per-slot at emission time; that is a larger sequencer change and is left as a follow-up. This PR is the query-level dedupe.network-tps.jsonis separately broken (max(total) - max(total)always evaluates to 0) and is out of scope here.Analysis with code refs and example log lines: https://gist.github.com/AztecBot/df6854ddd11bf951e3a3375909cd64d3
Created by claudebox · group:
slackbot