Skip to content

fix(metrics): dedupe HA validator replicas in slot fill-rate alert#24453

Draft
AztecBot wants to merge 1 commit into
nextfrom
cb/slot-fill-rate-ha-dedupe
Draft

fix(metrics): dedupe HA validator replicas in slot fill-rate alert#24453
AztecBot wants to merge 1 commit into
nextfrom
cb/slot-fill-rate-ha-dedupe

Conversation

@AztecBot

@AztecBot AztecBot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

The netlowSlotFillRate alert (Sequencer - low slot fill rate) fires spuriously on HA-deployed networks (e.g. next-net). It computes slot_filled_count / slot_total_count, and aztec_sequencer_slot_total_count is inflated ~2× while slot_filled_count stays accurate — so the ratio reads ~50% and trips the 80% threshold even when every slot is filled.

Observed on next-net: ~100 slots counted with ~50 filled over 1h, where 3600s / 72s = 50 is the true slot count.

Root cause

With VALIDATOR_HA_REPLICAS > 0, each logical validator is deployed as a primary + HA pair that share attesters (e.g. validator-0 and validator-ha-1-0). Both pods run the checkpoint build loop every slot, so both call SequencerMetrics.incOpenSlot()aztec_sequencer_slot_total_count += 1 (yarn-project/sequencer-client/src/sequencer/metrics.ts:278, from the build path at checkpoint_proposal_job.ts:565).

Only the pod that wins publish leader-election lands the checkpoint on L1 and calls incFilledSlot()aztec_sequencer_slot_filled_count += 1 (checkpoint_proposal_job.ts:312). HA PostgreSQL coordination prevents the second pod from double-signing/publishing.

Net effect per slot: 2 opens, 1 fill. The in-process lastSeenSlot dedup in incOpenSlot can't help — the two counts come from two separate pods. Confirmed in next-net logs: two distinct pods emit "Starting checkpoint proposal for slot N" per slot, one emits "Checkpoint published for slot N". No fisherman node is involved (fishermanMode=False on all validators).

This also inflates the miss count (total - filled), since a genuinely missed slot is built by both pods and published by neither.

Fix

Both HA pods emit aztec_sequencer_slot_total_count with the same aztec_block_proposer label (the on-chain selected proposer), differing only by pod/instance labels. Collapse them with max by (k8s_namespace_name, aztec_block_proposer) before summing across proposers, in both the ratio denominator and the >= 5 data-sufficiency guard:

sum by (k8s_namespace_name) (
  max by (k8s_namespace_name, aztec_block_proposer) (
    increase(aztec_sequencer_slot_total_count{...}[10m])
  )
)

slot_filled_count is left as a plain sum — it is not doubled (only the publisher increments it). Non-HA namespaces are unaffected: max by (proposer) over a single pod is that pod's count.

Notes / follow-ups

  • The proper source-side fix is to only incOpenSlot on the active leader pod so total_count is one-per-slot at emission time; that is a larger sequencer change and is left as a follow-up. This PR is the query-level dedupe.
  • The "Missed Slots" panel (id 56) in network-tps.json is separately broken (max(total) - max(total) always evaluates to 0) and is out of scope here.

Analysis with code refs and example log lines: https://gist.github.com/AztecBot/df6854ddd11bf951e3a3375909cd64d3


Created by claudebox · group: slackbot

@AztecBot AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant