Skip to content

otel metrics for autobahn/avail and p2p/mux#3682

Open
pompon0 wants to merge 8 commits into
mainfrom
gprusak-metrics
Open

otel metrics for autobahn/avail and p2p/mux#3682
pompon0 wants to merge 8 commits into
mainfrom
gprusak-metrics

Conversation

@pompon0

@pompon0 pompon0 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

It will give us insight into consensus state and rpc performance.

@pompon0 pompon0 requested review from bdchatham and wen-coding July 1, 2026 13:18
@cursor

cursor Bot commented Jul 1, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Observability-only changes on existing code paths; metric helpers dedupe QCs and use lightweight counters/histograms with no consensus or wire-protocol behavior changes.

Overview
Adds OpenTelemetry instrumentation for consensus availability and multiplexed P2P RPC so operators can track finality progress and stream performance.

Autobahn avail gets a new metrics package with gauges for latest app/commit road and global block indices, a histogram for proposal-to-commit latency, and commit-to-commit interval counters labeled by capped view timeouts. ObserveAppQC and ObserveCommitQC run from prune and from PushCommitQC, with mutex-backed dedup so stale or duplicate QCs are ignored.

P2P mux records per-stream open/close latency, in-flight streams, and send/recv message and byte counts, broken out by role (accept vs connect) and rpc_name. StreamKindConfig.Name is populated from RPC registration (request type name) when building the mux config. Stream lifecycle hooks live on open, send, recv, and close.

Also simplifies the legacy P2P metricsLabelCache mutex (embedded sync.RWMutex, any instead of interface{}) and a small rpc Serve refactor plus a client-side rate-limit comment typo fix.

Reviewed by Cursor Bugbot for commit ba94963. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJul 1, 2026, 1:41 PM

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.70330% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.04%. Comparing base (2378fca) to head (ba94963).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
sei-tendermint/internal/p2p/mux/metrics/metrics.go 93.10% 1 Missing and 1 partial ⚠️
sei-tendermint/internal/autobahn/avail/state.go 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3682      +/-   ##
==========================================
- Coverage   58.97%   58.04%   -0.94%     
==========================================
  Files        2263     2181      -82     
  Lines      187223   177649    -9574     
==========================================
- Hits       110421   103117    -7304     
+ Misses      66858    65384    -1474     
+ Partials     9944     9148     -796     
Flag Coverage Δ
sei-chain-pr 78.03% <96.70%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/autobahn/avail/inner.go 97.46% <100.00%> (+0.06%) ⬆️
...dermint/internal/autobahn/avail/metrics/metrics.go 100.00% <100.00%> (ø)
sei-tendermint/internal/p2p/metrics.go 100.00% <100.00%> (ø)
sei-tendermint/internal/p2p/mux/mux.go 78.85% <100.00%> (+0.18%) ⬆️
sei-tendermint/internal/p2p/mux/stream.go 92.85% <100.00%> (+0.35%) ⬆️
sei-tendermint/internal/p2p/mux/stream_state.go 91.66% <100.00%> (ø)
sei-tendermint/internal/p2p/rpc/rpc.go 85.33% <100.00%> (+1.75%) ⬆️
sei-tendermint/internal/autobahn/avail/state.go 79.26% <66.66%> (+0.10%) ⬆️
sei-tendermint/internal/p2p/mux/metrics/metrics.go 93.10% <93.10%> (ø)

... and 85 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread sei-tendermint/internal/autobahn/avail/inner.go
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated
seidroid[bot]
seidroid Bot previously requested changes Jul 1, 2026

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream I/O, plus minor refactors. The instrumentation is reasonable, but the new mux metrics file is not gofmt-clean (will fail the lint CI), and several monotonic totals are declared as UpDownCounters.

Findings: 2 blocking | 4 non-blocking | 2 posted inline

Blockers

  • sei-tendermint/internal/p2p/mux/metrics/metrics.go is not gofmt-clean, which violates the repo's formatting requirement and will fail golangci/make lint in CI (see inline comment). Run gofmt -s -w on the file.
  • 1 blocking issue(s) flagged inline on specific lines.

Non-blocking

  • Repository review guidelines (REVIEW_GUIDELINES.md) were empty/missing, so no repo-specific standards were applied beyond AGENTS.md.
  • The Cursor second-opinion pass (cursor-review.md) produced no output.
  • In avail/metrics.go ObserveCommitQC, proposalToCommitLatency.Record(...) is called before the ordering guard last.val.Index() >= qc.Index(), so a re-observed/stale CommitQC still records a latency sample. Consider moving the record after the guard if only fresh QCs should be measured (minor, metrics-only).
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
prometheus "github.com/prometheus/client_golang/prometheus"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] This file is not gofmt-clean and will fail the lint CI. Two issues: (1) within this import group the prometheus "github.com/prometheus/client_golang/prometheus" line is out of order — gofmt sorts by path, so github.com/... must come before the go.opentelemetry.io/... imports; (2) the following blank line (line 11) contains a trailing tab. Run gofmt -s -w to fix. (Also flagged by Codex.)

var meter = otel.Meter("tendermint_internal_p2p_mux")
var latency = utils.OrPanic1(meter.Float64Histogram("latency", metric.WithUnit("s"), buckets(0.001, 1.3, 30)))
var inFlight = utils.OrPanic1(meter.Int64UpDownCounter("inflight"))
var sendMsgs = utils.OrPanic1(meter.Int64UpDownCounter("send_msgs"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] sendMsgs, sendBytes, recvMsgs, and recvBytes are only ever incremented (Add(+n)), so they are monotonic totals and should be Int64Counter, not Int64UpDownCounter. UpDownCounter exports as a non-monotonic gauge, which loses the counter/_total semantics that rate() and similar queries expect. (inFlight correctly stays an UpDownCounter.) Also flagged by Codex.

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ba94963. Configure here.

commitToCommitLatencySum.Add(ctx, now.Sub(last.time).Seconds(), attrs)
commitToCommitLatencyCount.Add(ctx, 1, attrs)
}
proposalToCommitLatency.Record(ctx, now.Sub(qc.Proposal().Timestamp()).Seconds())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restart skews commit latency histogram

Medium Severity

On startup, observing the persisted prune anchor records proposal_to_commit_latency as wall time since the anchor proposal’s original timestamp. That can be hours or days of downtime, producing a bogus latency sample that distorts histograms and SLO views after every restart.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ba94963. Configure here.

)

func buckets(start float64, factor float64, count int) metric.HistogramOption {
return metric.WithExplicitBucketBoundaries(prometheus.ExponentialBuckets(start, factor, count)...)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histogram buckets use wrong helper

Low Severity

New OTel histograms build bucket boundaries via prometheus.ExponentialBuckets in a local helper. Repo convention for new OTel metrics is ExponentialBucketsRange so bucket spans stay consistent with instruments like mempool compact duration.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by learned rule: sei-tendermint: new metrics must be OTel native with established conventions

Reviewed by Cursor Bugbot for commit ba94963. Configure here.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observability-only PR adding OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream traffic. The changes are correct and well-structured (dedup guards, bounded label cardinality, safe mutex-by-value refactor); the only substantive concern is that the in-flight stream gauge can drift upward when streams are abandoned without a local Close.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • No tests were added for the new metrics helpers (ObserveCommitQC/ObserveAppQC dedup logic and the Stream metric lifecycle). A small unit test around the monotonic dedup guards (out-of-order QCs are ignored) and Open/Close idempotency would guard against regressions; acceptable to skip for an observability-only change, but worth noting.
  • Cursor's second-opinion pass (cursor-review.md) produced no output, and REVIEW_GUIDELINES.md was empty on the base branch, so no repo-specific standards were applied. Codex's single medium finding (the inFlight counter leak) is incorporated as an inline comment.
  • Minor: the Send metric is recorded when a message is enqueued to the frame queue (stream.go:95) rather than when it is actually written to the connection, so send_bytes/send_msgs slightly lead the wire. This is a reasonable approximation but worth a comment if precise byte accounting is ever expected.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

return
}
inner.closed.local = true
inner.metrics.Close()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] The inFlight up/down counter is only decremented here in the local close() path. Open() increments it once a stream's recv end is established, but the decrement never runs if a stream is abandoned without a local Close() — e.g. Mux.Run exits on a connection/read/write error, or a caller observes remote close (tryPrune in runRecv only fully prunes once both local and remote are closed) but never calls Close(). Because inFlight is a process-global counter shared across all connections, this drifts permanently upward across peer churn, making the gauge unreliable over a long-running node. Consider decrementing symmetrically on prune/teardown (e.g. from tryPrune or a runner shutdown hook) so an opened stream is always balanced by exactly one decrement. (Matches Codex's medium finding.)

@seidroid seidroid Bot dismissed their stale review July 1, 2026 13:46

Superseded: latest AI review found no blocking issues.

@bdchatham bdchatham left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xreview summary (opentelemetry / sei-networking / systems / idiom lenses). Nice, well-scoped instrumentation — most of it is clean. Three points worth resolving before merge, plus a few smaller ones inline.

🔴 MeterProvider wiring — where is the global MeterProvider set up? otel.Meter() is a no-op unless a provider with a Prometheus/OTLP reader and a service.name resource is registered in the binary, and I don't see it in this diff. If it isn't already wired in main, every instrument here silently records nothing — worth confirming before merge.

On your weighted-histogram question: you don't need to call Record more than once. You have exactly one latency value per commit, so it's a single Record into a Float64Histogram — the sum+count counter split was solving a problem the code doesn't have. Details in the inline comment on commitToCommitLatency.

Comment thread sei-tendermint/internal/p2p/mux/metrics/metrics.go
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go
"tendermint_internal_autobahn_avail__app_global_block_number",
metric.WithDescription("global block number of the highest observed appQC"),
))
var proposalToCommitLatency = utils.OrPanic1(meter.Float64Histogram(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Add metric.WithUnit("s") on the avail latency instruments — the mux-side latency sets it but these don't, so the exporter won't append _seconds and the two packages end up with inconsistently-named latency series.

return &RPC[API, Req, Resp]{kind, limit, req, resp}
service[kind] = &rpcConfig{
limit: limit,
name: fmt.Sprintf("%T", utils.Zero[Req]()),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 fmt.Sprintf("%T", ...) bakes the fully-qualified Go type into the rpc_name label, so renaming or moving the request type silently breaks dashboard continuity. Prefer an explicit, stable name string.

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go
f.Header.PayloadSize = utils.Alloc(uint64(len(msg)))
f.Header.MsgEnd = utils.Alloc(true)
}
inner.metrics.Send(len(msg))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Each .Add allocates ~16 B via the variadic AddConfig, and all same-kind streams share one aggregator mutex that's taken here while the stream lock is held — ~64 B / 4 allocs / 4 shared-lock acquisitions per Send+Recv round-trip. Fine at today's rates; if a mux ever gets hot, batch these or move the Add outside the stream lock.

@bdchatham

Copy link
Copy Markdown
Contributor

Weighted histograms in OTel — summary of the thread

TL;DR on "how do I migrate the weighted histograms without calling Record N times":

There's no weighted Record in the OTel Go API — a histogram counts observations, so weighting a single latency by N blocks means either N Record calls or emitting a pre-aggregated point yourself. Two things worth separating: is looping actually a problem (no), and what's the right metric (probably not a weighted histogram at all).

1. Looping Record is cheap at our scale. 1–120 blocks at single-digit commits/sec is single-digit µs/commit — invisible next to the sequencing work. The only real footgun is allocating the attribute set inside the loop, so hoist it:

recordOpts := metric.WithAttributeSet(attrSet) // build once, outside the loop
for i := uint64(0); i < blocks; i++ {
    seqLatency.Record(ctx, latencySec, recordOpts)
}

2. But I'd reconsider the weighted histogram itself. Replaying one shared per-commit measurement into N identical points inflates the statistical support and quietly redefines _count (blocks, not commits). The honest signal keeps latency and load orthogonal:

seqLatency.Record(ctx, latencySec, opts)            // per-commit, unweighted → commit-health p99
blocksSequenced.Add(ctx, int64(blocks), opts)       // blocks_sequenced_total → throughput
// optional: blocksPerCommit.Record(ctx, float64(blocks), opts) // batch-size distribution

Now rate(blocks_sequenced_total) gives throughput, the histogram stays meaningful, and you can correlate "latency rose when blocks/commit rose" on one panel — strictly more information than one weighted blob.

3. If you genuinely need a pre-aggregated / weighted histogram in OTel, it's doable natively via a custom metric.Producer — the SDK analogue of Prometheus MustNewConstHistogram. You register it on the reader (metric.WithProducer(...), passed into prometheus.New(...) — verify the exact symbol on our pinned SDK) and hand back a point whose Count/Sum/BucketCounts you set directly, so weighting is "add N to the bucket, N to count":

func (p *seqLatencyProducer) Observe(latencySec float64, blocks uint64) {
    p.mu.Lock(); defer p.mu.Unlock()
    i := sort.SearchFloat64s(p.bounds, latencySec) // first bound >= value
    p.bucketCounts[i] += blocks                    // weight by N, O(1)
    p.count += blocks
    p.sum += latencySec * float64(blocks)
}

func (p *seqLatencyProducer) Produce(ctx context.Context) ([]metricdata.ScopeMetrics, error) {
    p.mu.Lock()
    bc := append([]uint64(nil), p.bucketCounts...) // snapshot under lock
    dp := metricdata.HistogramDataPoint[float64]{
        Attributes: p.attrs, StartTime: p.start, Time: time.Now(),
        Count: p.count, Bounds: p.bounds, BucketCounts: bc, Sum: p.sum,
    }
    p.mu.Unlock()
    return []metricdata.ScopeMetrics{{
        Scope: instrumentation.Scope{Name: "consensus"},
        Metrics: []metricdata.Metrics{{
            Name: "consensus_block_sequencing_latency", Unit: "s",
            Data: metricdata.Histogram[float64]{
                Temporality: metricdata.CumulativeTemporality,
                DataPoints:  []metricdata.HistogramDataPoint[float64]{dp},
            },
        }},
    }}, nil
}

The catch: you now own the cumulative-temporality invariants (bucket counts monotonically non-decreasing + a stable StartTime, or Prometheus reads a reset and histogram_quantile breaks), your own concurrency, your own attribute sets, and you lose Views + exemplars (you've bypassed the SDK aggregator). That's worth it only when the bucket counts already exist upstream (a pre-aggregated subprocess/library/batch source) or the per-observation path is genuinely hot — i.e. folding in an aggregation, not synthesizing one. Not our case here, so I'd go with option 1 or 2.

On the API-ergonomics point: yes, client_golang's MustNewConstHistogram is more ergonomic for pre-aggregated data — that's the price of OTel being backend/temporality-agnostic (push+pull, delta+cumulative, Views, exemplars). If we ever want the raw ergonomics for a few metrics, we can register a plain client_golang Collector on the same Prometheus registry and it scrapes alongside the OTel series — but only under a pull/Prometheus exporter, not OTLP push.

}
// Constructed once per CommitQC, which we should afford.
attrs := metric.WithAttributeSet(attribute.NewSet(
// Timeouts capped: 20 means [20,inf)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we almost never have 20+ timeouts on one RoadIndex? What happens if the cluster gets stuck?

}

var observedCommitQC = newObserved[*types.CommitQC]()
var observedAppQC = newObserved[*types.AppQC]()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the full QC? We are only using RoadIndex, View Index, and timestamp I suppose?

return false, nil
}
i.latestAppQC = utils.Some(appQC)
metrics.ObserveAppQC(appQC)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be much later than we actually observe AppQC for this RoadIndex right? Because the block proposal packing AppQC is optional.

i.commitQCs.prune(idx)
if i.commitQCs.next == idx {
i.commitQCs.pushBack(commitQC)
metrics.ObserveCommitQC(c, commitQC)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm confused, why do we observe these QCs inside prune() instead of right after the QCs are verified?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants