otel metrics for autobahn/avail and p2p/mux by pompon0 · Pull Request #3682 · sei-protocol/sei-chain

pompon0 · 2026-07-01T13:17:49Z

It will give us insight into consensus state and rpc performance.

cursor · 2026-07-01T13:18:36Z

PR Summary

Low Risk
Observability-only changes on existing code paths; metric helpers dedupe QCs and use lightweight counters/histograms with no consensus or wire-protocol behavior changes.

Overview
Adds OpenTelemetry instrumentation for consensus availability and multiplexed P2P RPC so operators can track finality progress and stream performance.

Autobahn avail gets a new metrics package with gauges for latest app/commit road and global block indices, a histogram for proposal-to-commit latency, and commit-to-commit interval counters labeled by capped view timeouts. ObserveAppQC and ObserveCommitQC run from prune and from PushCommitQC, with mutex-backed dedup so stale or duplicate QCs are ignored.

P2P mux records per-stream open/close latency, in-flight streams, and send/recv message and byte counts, broken out by role (accept vs connect) and rpc_name. StreamKindConfig.Name is populated from RPC registration (request type name) when building the mux config. Stream lifecycle hooks live on open, send, recv, and close.

Also simplifies the legacy P2P metricsLabelCache mutex (embedded sync.RWMutex, any instead of interface{}) and a small rpc Serve refactor plus a client-side rate-limit comment typo fix.

^{Reviewed by Cursor Bugbot for commit ba94963. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-07-01T13:19:27Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Jul 1, 2026, 1:41 PM

codecov · 2026-07-01T13:20:49Z

Codecov Report

❌ Patch coverage is 96.70330% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.04%. Comparing base (2378fca) to head (ba94963).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
sei-tendermint/internal/p2p/mux/metrics/metrics.go	93.10%	1 Missing and 1 partial ⚠️
sei-tendermint/internal/autobahn/avail/state.go	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3682      +/-   ##
==========================================
- Coverage   58.97%   58.04%   -0.94%     
==========================================
  Files        2263     2181      -82     
  Lines      187223   177649    -9574     
==========================================
- Hits       110421   103117    -7304     
+ Misses      66858    65384    -1474     
+ Partials     9944     9148     -796

Flag	Coverage Δ
sei-chain-pr	`78.03% <96.70%> (?)`
sei-db	`70.41% <ø> (ø)`
sei-db-state-db	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sei-tendermint/internal/autobahn/avail/inner.go	`97.46% <100.00%> (+0.06%)`	⬆️
...dermint/internal/autobahn/avail/metrics/metrics.go	`100.00% <100.00%> (ø)`
sei-tendermint/internal/p2p/metrics.go	`100.00% <100.00%> (ø)`
sei-tendermint/internal/p2p/mux/mux.go	`78.85% <100.00%> (+0.18%)`	⬆️
sei-tendermint/internal/p2p/mux/stream.go	`92.85% <100.00%> (+0.35%)`	⬆️
sei-tendermint/internal/p2p/mux/stream_state.go	`91.66% <100.00%> (ø)`
sei-tendermint/internal/p2p/rpc/rpc.go	`85.33% <100.00%> (+1.75%)`	⬆️
sei-tendermint/internal/autobahn/avail/state.go	`79.26% <66.66%> (+0.10%)`	⬆️
sei-tendermint/internal/p2p/mux/metrics/metrics.go	`93.10% <93.10%> (ø)`

... and 85 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

seidroid

Adds OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream I/O, plus minor refactors. The instrumentation is reasonable, but the new mux metrics file is not gofmt-clean (will fail the lint CI), and several monotonic totals are declared as UpDownCounters.

Findings: 2 blocking | 4 non-blocking | 2 posted inline

Blockers

sei-tendermint/internal/p2p/mux/metrics/metrics.go is not gofmt-clean, which violates the repo's formatting requirement and will fail golangci/make lint in CI (see inline comment). Run gofmt -s -w on the file.
1 blocking issue(s) flagged inline on specific lines.

Non-blocking

Repository review guidelines (REVIEW_GUIDELINES.md) were empty/missing, so no repo-specific standards were applied beyond AGENTS.md.
The Cursor second-opinion pass (cursor-review.md) produced no output.
In avail/metrics.go ObserveCommitQC, proposalToCommitLatency.Record(...) is called before the ordering guard last.val.Index() >= qc.Index(), so a re-observed/stale CommitQC still records a latency sample. Consider moving the record after the guard if only fresh QCs should be measured (minor, metrics-only).
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-01T13:24:02Z

+	"go.opentelemetry.io/otel"
+	"go.opentelemetry.io/otel/attribute"
+	"go.opentelemetry.io/otel/metric"
+	prometheus "github.com/prometheus/client_golang/prometheus"


[blocker] This file is not gofmt-clean and will fail the lint CI. Two issues: (1) within this import group the prometheus "github.com/prometheus/client_golang/prometheus" line is out of order — gofmt sorts by path, so github.com/... must come before the go.opentelemetry.io/... imports; (2) the following blank line (line 11) contains a trailing tab. Run gofmt -s -w to fix. (Also flagged by Codex.)

seidroid · 2026-07-01T13:24:02Z

+var meter = otel.Meter("tendermint_internal_p2p_mux")
+var latency = utils.OrPanic1(meter.Float64Histogram("latency", metric.WithUnit("s"), buckets(0.001, 1.3, 30)))
+var inFlight = utils.OrPanic1(meter.Int64UpDownCounter("inflight"))
+var sendMsgs = utils.OrPanic1(meter.Int64UpDownCounter("send_msgs"))


[suggestion] sendMsgs, sendBytes, recvMsgs, and recvBytes are only ever incremented (Add(+n)), so they are monotonic totals and should be Int64Counter, not Int64UpDownCounter. UpDownCounter exports as a non-monotonic gauge, which loses the counter/_total semantics that rate() and similar queries expect. (inFlight correctly stays an UpDownCounter.) Also flagged by Codex.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ba94963. Configure here.}

cursor · 2026-07-01T13:42:56Z

+			commitToCommitLatencySum.Add(ctx, now.Sub(last.time).Seconds(), attrs)
+			commitToCommitLatencyCount.Add(ctx, 1, attrs)
+		}
+		proposalToCommitLatency.Record(ctx, now.Sub(qc.Proposal().Timestamp()).Seconds())


Restart skews commit latency histogram

Medium Severity

On startup, observing the persisted prune anchor records proposal_to_commit_latency as wall time since the anchor proposal’s original timestamp. That can be hours or days of downtime, producing a bogus latency sample that distorts histograms and SLO views after every restart.

Additional Locations (1)

sei-tendermint/internal/autobahn/avail/inner.go#L91-L93

^{Reviewed by Cursor Bugbot for commit ba94963. Configure here.}

cursor · 2026-07-01T13:42:56Z

+)
+
+func buckets(start float64, factor float64, count int) metric.HistogramOption {
+	return metric.WithExplicitBucketBoundaries(prometheus.ExponentialBuckets(start, factor, count)...)


Histogram buckets use wrong helper

Low Severity

New OTel histograms build bucket boundaries via prometheus.ExponentialBuckets in a local helper. Repo convention for new OTel metrics is ExponentialBucketsRange so bucket spans stay consistent with instruments like mempool compact duration.

Additional Locations (1)

sei-tendermint/internal/p2p/mux/metrics/metrics.go#L14-L16

^{Triggered by learned rule: sei-tendermint: new metrics must be OTel native with established conventions}

^{Reviewed by Cursor Bugbot for commit ba94963. Configure here.}

seidroid

Observability-only PR adding OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream traffic. The changes are correct and well-structured (dedup guards, bounded label cardinality, safe mutex-by-value refactor); the only substantive concern is that the in-flight stream gauge can drift upward when streams are abandoned without a local Close.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

None at the file/PR level.

Non-blocking

No tests were added for the new metrics helpers (ObserveCommitQC/ObserveAppQC dedup logic and the Stream metric lifecycle). A small unit test around the monotonic dedup guards (out-of-order QCs are ignored) and Open/Close idempotency would guard against regressions; acceptable to skip for an observability-only change, but worth noting.
Cursor's second-opinion pass (cursor-review.md) produced no output, and REVIEW_GUIDELINES.md was empty on the base branch, so no repo-specific standards were applied. Codex's single medium finding (the inFlight counter leak) is incorporated as an inline comment.
Minor: the Send metric is recorded when a message is enqueued to the frame queue (stream.go:95) rather than when it is actually written to the connection, so send_bytes/send_msgs slightly lead the wire. This is a reasonable approximation but worth a comment if precise byte accounting is ever expected.
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-01T13:46:00Z

 		return
 	}
 	inner.closed.local = true
+	inner.metrics.Close()


[suggestion] The inFlight up/down counter is only decremented here in the local close() path. Open() increments it once a stream's recv end is established, but the decrement never runs if a stream is abandoned without a local Close() — e.g. Mux.Run exits on a connection/read/write error, or a caller observes remote close (tryPrune in runRecv only fully prunes once both local and remote are closed) but never calls Close(). Because inFlight is a process-global counter shared across all connections, this drifts permanently upward across peer churn, making the gauge unreliable over a long-running node. Consider decrementing symmetrically on prune/teardown (e.g. from tryPrune or a runner shutdown hook) so an opened stream is always balanced by exactly one decrement. (Matches Codex's medium finding.)

Superseded: latest AI review found no blocking issues.

bdchatham

xreview summary (opentelemetry / sei-networking / systems / idiom lenses). Nice, well-scoped instrumentation — most of it is clean. Three points worth resolving before merge, plus a few smaller ones inline.

🔴 MeterProvider wiring — where is the global MeterProvider set up? otel.Meter() is a no-op unless a provider with a Prometheus/OTLP reader and a service.name resource is registered in the binary, and I don't see it in this diff. If it isn't already wired in main, every instrument here silently records nothing — worth confirming before merge.

On your weighted-histogram question: you don't need to call Record more than once. You have exactly one latency value per commit, so it's a single Record into a Float64Histogram — the sum+count counter split was solving a problem the code doesn't have. Details in the inline comment on commitToCommitLatency.

bdchatham · 2026-07-01T15:38:39Z

+	"tendermint_internal_autobahn_avail__app_global_block_number",
+	metric.WithDescription("global block number of the highest observed appQC"),
+))
+var proposalToCommitLatency = utils.OrPanic1(meter.Float64Histogram(


🟡 Add metric.WithUnit("s") on the avail latency instruments — the mux-side latency sets it but these don't, so the exporter won't append _seconds and the two packages end up with inconsistently-named latency series.

bdchatham · 2026-07-01T15:38:39Z

-	return &RPC[API, Req, Resp]{kind, limit, req, resp}
+	service[kind] = &rpcConfig{
+		limit: limit,
+		name:  fmt.Sprintf("%T", utils.Zero[Req]()),


🟡 fmt.Sprintf("%T", ...) bakes the fully-qualified Go type into the rpc_name label, so renaming or moving the request type silently breaks dashboard continuity. Prefer an explicit, stable name string.

bdchatham · 2026-07-01T15:38:39Z

 			f.Header.PayloadSize = utils.Alloc(uint64(len(msg)))
 			f.Header.MsgEnd = utils.Alloc(true)
 		}
+		inner.metrics.Send(len(msg))


🟡 Each .Add allocates ~16 B via the variadic AddConfig, and all same-kind streams share one aggregator mutex that's taken here while the stream lock is held — ~64 B / 4 allocs / 4 shared-lock acquisitions per Send+Recv round-trip. Fine at today's rates; if a mux ever gets hot, batch these or move the Add outside the stream lock.

bdchatham · 2026-07-01T15:41:04Z

Weighted histograms in OTel — summary of the thread

TL;DR on "how do I migrate the weighted histograms without calling Record N times":

There's no weighted Record in the OTel Go API — a histogram counts observations, so weighting a single latency by N blocks means either N Record calls or emitting a pre-aggregated point yourself. Two things worth separating: is looping actually a problem (no), and what's the right metric (probably not a weighted histogram at all).

1. Looping Record is cheap at our scale. 1–120 blocks at single-digit commits/sec is single-digit µs/commit — invisible next to the sequencing work. The only real footgun is allocating the attribute set inside the loop, so hoist it:

recordOpts := metric.WithAttributeSet(attrSet) // build once, outside the loop
for i := uint64(0); i < blocks; i++ {
    seqLatency.Record(ctx, latencySec, recordOpts)
}

2. But I'd reconsider the weighted histogram itself. Replaying one shared per-commit measurement into N identical points inflates the statistical support and quietly redefines _count (blocks, not commits). The honest signal keeps latency and load orthogonal:

seqLatency.Record(ctx, latencySec, opts)            // per-commit, unweighted → commit-health p99
blocksSequenced.Add(ctx, int64(blocks), opts)       // blocks_sequenced_total → throughput
// optional: blocksPerCommit.Record(ctx, float64(blocks), opts) // batch-size distribution

Now rate(blocks_sequenced_total) gives throughput, the histogram stays meaningful, and you can correlate "latency rose when blocks/commit rose" on one panel — strictly more information than one weighted blob.

3. If you genuinely need a pre-aggregated / weighted histogram in OTel, it's doable natively via a custom metric.Producer — the SDK analogue of Prometheus MustNewConstHistogram. You register it on the reader (metric.WithProducer(...), passed into prometheus.New(...) — verify the exact symbol on our pinned SDK) and hand back a point whose Count/Sum/BucketCounts you set directly, so weighting is "add N to the bucket, N to count":

func (p *seqLatencyProducer) Observe(latencySec float64, blocks uint64) {
    p.mu.Lock(); defer p.mu.Unlock()
    i := sort.SearchFloat64s(p.bounds, latencySec) // first bound >= value
    p.bucketCounts[i] += blocks                    // weight by N, O(1)
    p.count += blocks
    p.sum += latencySec * float64(blocks)
}

func (p *seqLatencyProducer) Produce(ctx context.Context) ([]metricdata.ScopeMetrics, error) {
    p.mu.Lock()
    bc := append([]uint64(nil), p.bucketCounts...) // snapshot under lock
    dp := metricdata.HistogramDataPoint[float64]{
        Attributes: p.attrs, StartTime: p.start, Time: time.Now(),
        Count: p.count, Bounds: p.bounds, BucketCounts: bc, Sum: p.sum,
    }
    p.mu.Unlock()
    return []metricdata.ScopeMetrics{{
        Scope: instrumentation.Scope{Name: "consensus"},
        Metrics: []metricdata.Metrics{{
            Name: "consensus_block_sequencing_latency", Unit: "s",
            Data: metricdata.Histogram[float64]{
                Temporality: metricdata.CumulativeTemporality,
                DataPoints:  []metricdata.HistogramDataPoint[float64]{dp},
            },
        }},
    }}, nil
}

The catch: you now own the cumulative-temporality invariants (bucket counts monotonically non-decreasing + a stable StartTime, or Prometheus reads a reset and histogram_quantile breaks), your own concurrency, your own attribute sets, and you lose Views + exemplars (you've bypassed the SDK aggregator). That's worth it only when the bucket counts already exist upstream (a pre-aggregated subprocess/library/batch source) or the per-observation path is genuinely hot — i.e. folding in an aggregation, not synthesizing one. Not our case here, so I'd go with option 1 or 2.

On the API-ergonomics point: yes, client_golang's MustNewConstHistogram is more ergonomic for pre-aggregated data — that's the price of OTel being backend/temporality-agnostic (push+pull, delta+cumulative, Views, exemplars). If we ever want the raw ergonomics for a few metrics, we can register a plain client_golang Collector on the same Prometheus registry and it scrapes alongside the OTel series — but only under a pull/Prometheus exporter, not OTLP push.

wen-coding · 2026-07-01T20:05:32Z

+			}
+			// Constructed once per CommitQC, which we should afford.
+			attrs := metric.WithAttributeSet(attribute.NewSet(
+				// Timeouts capped: 20 means [20,inf)


I suppose we almost never have 20+ timeouts on one RoadIndex? What happens if the cluster gets stuck?

wen-coding · 2026-07-01T20:22:57Z

+}
+
+var observedCommitQC = newObserved[*types.CommitQC]()
+var observedAppQC = newObserved[*types.AppQC]()


Do we need the full QC? We are only using RoadIndex, View Index, and timestamp I suppose?

wen-coding · 2026-07-01T20:58:48Z

 		return false, nil
 	}
 	i.latestAppQC = utils.Some(appQC)
+	metrics.ObserveAppQC(appQC)


This could be much later than we actually observe AppQC for this RoadIndex right? Because the block proposal packing AppQC is optional.

wen-coding · 2026-07-01T20:59:18Z

 	i.commitQCs.prune(idx)
 	if i.commitQCs.next == idx {
 		i.commitQCs.pushBack(commitQC)
+		metrics.ObserveCommitQC(c, commitQC)


Sorry I'm confused, why do we observe these QCs inside prune() instead of right after the QCs are verified?

pompon0 added 4 commits July 1, 2026 13:01

rpc metrics

b19b54a

mux metrics

fa43d93

consensus metrics

f1d08a4

metrics for autobahn/avail and p2p/mux

eb45dad

pompon0 requested review from bdchatham and wen-coding July 1, 2026 13:18

pompon0 added the non-app-hash-breaking label Jul 1, 2026

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/avail/inner.go

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

seidroid Bot previously requested changes Jul 1, 2026

View reviewed changes

pompon0 added 2 commits July 1, 2026 15:30

long names

1dc194a

removed updown

d6737d1

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go

pompon0 added 2 commits July 1, 2026 15:37

cap on timeouts in metrics

df9c86a

moved latency recording

ba94963

cursor Bot reviewed Jul 1, 2026

View reviewed changes

seidroid Bot approved these changes Jul 1, 2026

View reviewed changes

bdchatham reviewed Jul 1, 2026

View reviewed changes

bdchatham approved these changes Jul 1, 2026

View reviewed changes

wen-coding reviewed Jul 1, 2026

View reviewed changes

Uh oh!

Conversation

pompon0 commented Jul 1, 2026

Uh oh!

cursor Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Restart skews commit latency histogram

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Histogram buckets use wrong helper

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

bdchatham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bdchatham Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

bdchatham Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bdchatham Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

bdchatham commented Jul 1, 2026

Weighted histograms in OTel — summary of the thread

Uh oh!

wen-coding Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

wen-coding Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

wen-coding Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

wen-coding Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

cursor Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading