fix(metrics): dedupe HA validator replicas in slot fill-rate alert by AztecBot · Pull Request #24453 · AztecProtocol/aztec-packages

AztecBot · 2026-07-02T07:44:59Z

Problem

The netlowSlotFillRate alert (Sequencer - low slot fill rate) fires spuriously on HA-deployed networks (e.g. next-net). It computes slot_filled_count / slot_total_count, and aztec_sequencer_slot_total_count is inflated ~2× while slot_filled_count stays accurate — so the ratio reads ~50% and trips the 80% threshold even when every slot is filled.

Observed on next-net: ~100 slots counted with ~50 filled over 1h, where 3600s / 72s = 50 is the true slot count.

Root cause

With VALIDATOR_HA_REPLICAS > 0, each logical validator is deployed as a primary + HA pair that share attesters (e.g. validator-0 and validator-ha-1-0). Both pods run the checkpoint build loop every slot, so both call SequencerMetrics.incOpenSlot() → aztec_sequencer_slot_total_count += 1 (yarn-project/sequencer-client/src/sequencer/metrics.ts:278, from the build path at checkpoint_proposal_job.ts:565).

Only the pod that wins publish leader-election lands the checkpoint on L1 and calls incFilledSlot() → aztec_sequencer_slot_filled_count += 1 (checkpoint_proposal_job.ts:312). HA PostgreSQL coordination prevents the second pod from double-signing/publishing.

Net effect per slot: 2 opens, 1 fill. The in-process lastSeenSlot dedup in incOpenSlot can't help — the two counts come from two separate pods. Confirmed in next-net logs: two distinct pods emit "Starting checkpoint proposal for slot N" per slot, one emits "Checkpoint published for slot N". No fisherman node is involved (fishermanMode=False on all validators).

This also inflates the miss count (total - filled), since a genuinely missed slot is built by both pods and published by neither.

Fix

Both HA pods emit aztec_sequencer_slot_total_count with the same aztec_block_proposer label (the on-chain selected proposer), differing only by pod/instance labels. Collapse them with max by (k8s_namespace_name, aztec_block_proposer) before summing across proposers, in both the ratio denominator and the >= 5 data-sufficiency guard:

sum by (k8s_namespace_name) (
  max by (k8s_namespace_name, aztec_block_proposer) (
    increase(aztec_sequencer_slot_total_count{...}[10m])
  )
)

slot_filled_count is left as a plain sum — it is not doubled (only the publisher increments it). Non-HA namespaces are unaffected: max by (proposer) over a single pod is that pod's count.

Notes / follow-ups

The proper source-side fix is to only incOpenSlot on the active leader pod so total_count is one-per-slot at emission time; that is a larger sequencer change and is left as a follow-up. This PR is the query-level dedupe.
The "Missed Slots" panel (id 56) in network-tps.json is separately broken (max(total) - max(total) always evaluates to 0) and is out of scope here.

Analysis with code refs and example log lines: https://gist.github.com/AztecBot/df6854ddd11bf951e3a3375909cd64d3

Created by claudebox · group: slackbot

fix(metrics): dedupe HA validator replicas in slot fill-rate alert

93ad55d

AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(metrics): dedupe HA validator replicas in slot fill-rate alert#24453

fix(metrics): dedupe HA validator replicas in slot fill-rate alert#24453
AztecBot wants to merge 1 commit into
nextfrom
cb/slot-fill-rate-ha-dedupe

AztecBot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AztecBot commented Jul 2, 2026

Problem

Root cause

Fix

Notes / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant