feat(autobahn): Add epoch.Registry to maintain committee/stake (CON-358)#3632
feat(autobahn): Add epoch.Registry to maintain committee/stake (CON-358)#3632wen-coding wants to merge 71 commits into
Conversation
PR SummaryHigh Risk Overview Prepare/Commit/FullCommit/Timeout QCs verify against Avail and data are wired to Minor: Reviewed by Cursor Bugbot for commit 40877c0. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
d764443 to
705d8c9
Compare
ab4c67c to
9cce5d2
Compare
f1f1d55 to
09e677e
Compare
| TimeoutQC utils.Option[*TimeoutQC] | ||
| CommitQC utils.Option[*CommitQC] | ||
| TimeoutQC utils.Option[*TimeoutQC] | ||
| FirstBlock GlobalBlockNumber // genesis InitialHeight; added to lane-relative block numbers to produce absolute global block numbers |
There was a problem hiding this comment.
It seems to me that ViewSpec should have the whole EpochConfig instead.
There was a problem hiding this comment.
currently treatment of firstblock and genesisTimestamp is inconsistent
There was a problem hiding this comment.
Put Epoch into ViewSpec.
|
|
||
| // Verify verifies the FullProposal against the current view. | ||
| func (m *FullProposal) Verify(c *Committee, vs ViewSpec) error { | ||
| func (m *FullProposal) Verify(c *Committee, vs ViewSpec, genesisTimestamp time.Time) error { |
There was a problem hiding this comment.
genesisTimestamp seems kind of random in this context and that's why IMO ViewSpec should be self-contained.
| Timestamp: TimeConv.Encode(m.timestamp), | ||
| LaneRanges: LaneRangeConv.EncodeSlice(laneRanges), | ||
| App: AppProposalConv.EncodeOpt(m.app), | ||
| GlobalFirst: &globalFirst, |
There was a problem hiding this comment.
nit: utils.Alloc(uint64(m.firstBlock))
There was a problem hiding this comment.
why inconsistent naming (globalFirst vs firstBlock)?
| app, | ||
| ) | ||
| if m.GlobalFirst == nil { | ||
| return nil, fmt.Errorf("global_first is required") |
There was a problem hiding this comment.
nit: "global_first: missing" for consistency
| // GenProposal generates a random Proposal. | ||
| func GenProposal(rng utils.Rng) *Proposal { | ||
| return newProposal(GenView(rng), time.Now(), utils.GenSlice(rng, GenLaneRange), utils.Some(GenAppProposal(rng))) | ||
| return newProposal(GenView(rng), time.Now(), utils.GenSlice(rng, GenLaneRange), utils.Some(GenAppProposal(rng)), 0) |
There was a problem hiding this comment.
randomize the generated data
| Number: GenViewNumber(rng) % view.Number, | ||
| } | ||
| p := newProposal(pView, time.Now(), utils.GenSlice(rng, GenLaneRange), utils.Some(GenAppProposal(rng))) | ||
| p := newProposal(pView, time.Now(), utils.GenSlice(rng, GenLaneRange), utils.Some(GenAppProposal(rng)), 0) |
| // | ||
| // Currently the committee is fixed at genesis. Dynamic committee support | ||
| // will be wired up when the execution layer is ready. | ||
| type Registry struct { |
There was a problem hiding this comment.
I would like to have a type representing a complete context for an epoch (perhaps just named "Epoch"):
committee,
firstBlock, (or globalRange?)
firstTimestamp,
epochID (?)
...
I currently don't understand what "Registry" represents, but that's perhaps because it is not implemented yet.
There was a problem hiding this comment.
types.Epoch added.
Registry is basically a list of Epochs protected by rwlock, so we can keep generating new Epochs and throwing into it, it will return Epoch from EpochId for readers and purge old Epochs.
| // EpochWindow returns the epoch→committee map for message acceptance across | ||
| // epoch transitions. With a fixed genesis committee it always returns a | ||
| // single entry. | ||
| func (r *Registry) EpochWindow() map[Index]*types.Committee { |
There was a problem hiding this comment.
what is "window"? I don't see it in the doc.
There was a problem hiding this comment.
This is mostly for the Epoch transition, we need to accept lane blocks for the current Epoch and possibly the Epoch before (repair blocks) or after it (others have got the commitQC to transition into new Epoch before me).
| // EpochFor returns the epoch index for the given RoadIndex. | ||
| // Currently always returns 0 (genesis epoch); dynamic lookup will be added | ||
| // with the execution layer. | ||
| func (r *Registry) EpochFor(_ types.RoadIndex) Index { |
There was a problem hiding this comment.
If feasible, I'd like the all the messages to include the EpochID, so that we don't have to map RoadIndex to EpochID. Perhaps we could even make the RoadIndex start from 0 at every epoch start.
There was a problem hiding this comment.
Added EpochID whenever we don't have a proposal attached.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire epoch_index into AppProposal, LaneQC, and TimeoutVote so each message self-declares its epoch. Cross-validate in FullProposal.Verify (LaneQC epoch vs proposal epoch) and TimeoutQC/FullTimeoutVote.Verify (vote epoch vs expected epoch). Replace EpochFor(roadIndex) in PushAppVote with EpochByIndex from the proposal's self-declared epoch_index. Thread *Epoch (instead of *Committee) through block_votes.pushVote and inner.laneQC so the epoch index is available when constructing LaneQCs. Fix nil-deref in data.State.PushBlock: drop pruned-block entries silently rather than dereferencing a nil QC map entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove Registry.EpochFor (road-index lookup) now that all messages carry epoch_index and callers use EpochByIndex directly. Test helpers that built ViewSpec now set Epoch from registry.LatestEpoch() and read vs.Epoch.Committee() directly. Fix nil map lookup in data.State.PushBlock: drop blocks whose QC entry was already pruned rather than dereferencing a nil pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oposal.Verify The field is not part of the signed LaneVote payload; it is cross-checked against the signed proposal's epoch_index in FullProposal.Verify. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ushAppVote Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verify AppProposal.epoch_index matches the enclosing proposal's epoch_index in FullProposal.Verify. Reject AppProposal and LaneQC on decode when epoch_index is missing rather than silently defaulting to zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…paths
- AppProposal.Verify: check epoch_index matches CommitQC's proposal
- PushAppQC: check AppProposal epoch_index matches CommitQC epoch_index
- FullProposal.Verify: check proposal road_index is within Epoch.Roads()
- Add OpenRoadRange() helper; use it in registry, testonly, and tests
instead of RoadRange{} (zero Last) or bare math.MaxUint64
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… in sync The advisory note on LaneQC.epoch_index belongs in the Go struct comment in lane_qc.go, not the proto file which would require regenerating pb.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dy checked pre-lock VerifyInWindow now returns all epochs whose committee accepted the block. PushBlock skips the in-lock re-verify when blockEp is already in that set, avoiding duplicate crypto work in the common single-epoch case. Regenerate autobahn.pb.go to match current proto. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ed set Removes in-lock block.Verify entirely. If blockEp is not in the set returned by VerifyInWindow, the block was signed by the wrong epoch and is rejected. No crypto inside the lock. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…FullCommitQC Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orter; rename insertQC param ep -> registry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d cutover in Proposal decoder Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restores range-bounds and max-length validation as defense-in-depth. Lane membership is omitted (needs committee context, enforced by FullProposal.Verify). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…f hardcoding 0 TestCommitQC now takes *types.Epoch instead of committee+firstBlock+genesisTimestamp, ensuring laneQCs and appQC use the same epochIndex as the proposal being built. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
qc.Verify(c) already binds the committee, so epochIndex in LaneQC is redundant. Drop it from the struct, proto, encode/decode, and all callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follow-up to the LaneQC epochIndex removal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When AddEpoch is wired, recovery must resolve the epoch from the persisted QC/proposal rather than assuming LatestEpoch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover wrong epochIndex, wrong firstBlock, and out-of-roads road index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
69bfc5f to
86c7599
Compare
There was a problem hiding this comment.
A large, mostly-mechanical refactor introducing epoch.Registry and baking first_block/epoch_index into proposals, with solid added epoch-binding verification and comprehensive test updates. No blocking correctness issues in the currently single-epoch code paths; the notable findings are forward-looking hazards for when dynamic epochs (AddEpoch) are wired, which are already documented with TODOs.
Findings: 0 blocking | 7 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- REVIEW_GUIDELINES.md is empty (no repo-specific standards applied) and cursor-review.md produced no output — only the Codex second-opinion pass contributed findings, both of which are forward-looking.
- Codex P1 (registry.go AddEpoch): closing an epoch replaces the slice entry, but consensus/avail cache *types.Epoch pointers (inner.ep, myView.Epoch) taken from LatestEpoch(). Those cached epochs retain Roads().Last=MaxUint64, so once AddEpoch is wired a QC for a post-boundary road could still be accepted under the old committee until every cached pointer is rotated. Not exploitable in this PR (AddEpoch has no production callers; a single open epoch exists) and TODOs in consensus/inner.go and avail/state.go already flag the rotation gap — worth resolving in the same PR that wires AddEpoch.
- Codex P1 (registry.go VerifyInWindow / data.PushBlock): VerifyInWindow only checks the latest epoch, while data.PushBlock rejects a block whose QC resolves to an epoch not in that returned set ("signed by epoch X, not in window"). After a future epoch transition, syncing blocks finalized under the previous epoch would fail. Again inert today (one epoch) and covered by the VerifyInWindow TODO to expand to neighbor epochs — track alongside dynamic-epoch enablement.
- Proto migration hard-rejects messages missing first_block/epoch_index with no backward-compat path; the code comments justify this as acceptable because Autobahn is pre-production. Confirm no persisted state or in-flight peers predate these fields before deploying, since this is an app-hash/wire-incompatible change despite the non-app-hash-breaking label.
- avail.PushAppVote resolves the committee for signature verification from the vote's own (attacker-controlled) proposal.EpochIndex() before cross-checking it against the finalizing CommitQC's epoch; harmless with one epoch and the later equality check makes it consistent, but once multiple epochs exist consider ordering the epoch/CommitQC agreement check before trusting the declared epoch's committee.
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
| // retroactively reassigning already-finalized roads to the new committee. | ||
| // Requires the caller to pass the commit watermark once the execution bridge is wired. | ||
| // Replace latest with a closed version (Roads().Last = startRoad-1). | ||
| r.epochs[len(r.epochs)-1] = types.NewEpoch( |
There was a problem hiding this comment.
[suggestion] AddEpoch closes the previous epoch by allocating a new *types.Epoch and replacing the slice entry, but consumers cache the *types.Epoch pointer they obtained earlier from LatestEpoch() (e.g. consensus inner.ep, myView.Epoch, avail inner.ep). Those cached pointers keep Roads().Last = MaxUint64, so a QC for a road past the boundary could still verify under the old committee until each cached pointer is rotated. Inert today because AddEpoch has no production callers, but when it is wired ensure all cached epochs are refreshed (the TODOs in consensus/inner.go and avail/state.go PushCommitQC track this).
| // Returns a slice of all matching epochs so callers can skip re-verification for any | ||
| // epoch already checked here. | ||
| // TODO: expand to neighbor epochs (previous and next) once multi-epoch transitions are wired up. | ||
| func (r *Registry) VerifyInWindow(fn func(*types.Committee) error) ([]*types.Epoch, error) { |
There was a problem hiding this comment.
[suggestion] VerifyInWindow only verifies against the latest epoch and returns just that epoch. data.PushBlock then rejects any block whose QC resolves to an epoch not in this returned set ("signed by epoch X, not in window"). After a future epoch transition, still-needed blocks finalized under the previous epoch would fail to sync. Not reachable while only one epoch exists; resolve together with the AddEpoch wiring per the existing TODO to include neighbor epochs.
EpochByIndex(Index(p.EpochIndex())) is the direct call; the wrapper adds no logic beyond an inline-able error message. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 40877c0. Configure here.
| return fmt.Errorf("laneRange[%v]: non-zero hash for empty range", r.Lane()) | ||
| } | ||
| if got := r.Len(); got > MaxLaneRangeInProposal { | ||
| return fmt.Errorf("laneRange[%v].Len() = %d, want <= %d", r.Lane(), got, MaxLaneRangeInProposal) |
There was a problem hiding this comment.
Invalid lanes crash full QC build
High Severity
Proposal.Verify no longer ensures each lane range belongs to the committee; that check moved to FullProposal.Verify only. CommitQC.Verify still calls Proposal.Verify, so a commit QC whose proposal includes a non-committee lane can pass verification and be stored. When availability later assembles a FullCommitQC, header collection walks only committee lanes while GlobalRange() counts every lane in the proposal, so NewFullCommitQC can panic on length mismatch.
Reviewed by Cursor Bugbot for commit 40877c0. Configure here.
| vs := ViewSpec{ | ||
| CommitQC: prev, | ||
| Epoch: NewEpoch(0, OpenRoadRange(), genesisTimestamp, committee, firstBlock), | ||
| } |
There was a problem hiding this comment.
BuildCommitQC hardcodes epoch zero
Low Severity
BuildCommitQC always builds its ViewSpec with NewEpoch(0, …) regardless of the caller’s epoch. Callers such as data.TestCommitQC pass ep.FirstBlock() and attach app proposals using ep.EpochIndex(), so for any non-genesis epoch index the resulting commit QC proposal and app metadata disagree and epoch binding checks fail or build misleading test fixtures.
Reviewed by Cursor Bugbot for commit 40877c0. Configure here.
There was a problem hiding this comment.
A large but internally consistent refactor introducing epoch.Registry and baking firstBlock/epochIndex into consensus messages so GlobalRange() is committee-free; no live correctness bugs found. The Codex-flagged multi-epoch issues are real future concerns but are dead-code-guarded (AddEpoch is never called in production) and explicitly documented with TODOs.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion coverage: cursor-review.md is empty (Cursor produced no output); Codex ran but could not execute tests (Go 1.25.6 unavailable in its sandbox). Review rests on the Claude pass plus Codex's static findings.
- Latent multi-epoch limitation (from Codex, kept but downgraded to non-blocking): the signature-verification path VerifyInWindow (registry.go:117) and consensus's inner.ep pinned to registry.LatestEpoch() (inner.go:117, pushCommitQC inner.go:132/141) only ever consult the latest epoch. This is NOT a live bug — AddEpoch is never invoked in production (NewRegistry always builds a single genesis epoch) and the code carries explicit TODOs (registry.go:116, inner.go:114-116). Flagging so the follow-up PR that wires AddEpoch resolves the epoch from the QC/road index and rotates inner.ep on epoch transitions/restart before enabling dynamic epochs; otherwise valid older-epoch blocks/votes would be rejected and consensus could use the wrong committee.
- AddEpoch is exported and reachable but unused, and its guard only rejects startRoad <= latest.Roads().First without rejecting startRoad <= the highest already-committed road (its own TODO at registry.go:90-92). Consider keeping it unexported or panicking until the execution bridge is wired, to prevent a future caller from enabling it before the safety checks exist.
- Minor redundancy: FullProposal.Verify runs proposal.Verify(vs.Epoch) (which now does structural lane-range bound/len checks) and then separately re-loops proposal.laneRanges to call r.Verify(c) plus an identical Len(...) > MaxLaneRangeInProposal check (proposal.go). The max-length check in particular is duplicated; could be consolidated.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
| // Returns a slice of all matching epochs so callers can skip re-verification for any | ||
| // epoch already checked here. | ||
| // TODO: expand to neighbor epochs (previous and next) once multi-epoch transitions are wired up. | ||
| func (r *Registry) VerifyInWindow(fn func(*types.Committee) error) ([]*types.Epoch, error) { |
There was a problem hiding this comment.
[suggestion] Non-blocking / forward-looking: VerifyInWindow only verifies against LatestEpoch(). Callers (data.State.PushBlock, avail.State.PushBlock/PushVote) verify the block/vote signature here before resolving the authoritative epoch from the QC. Today this is fine because AddEpoch is never called in production, so the registry always holds exactly one epoch — but once AddEpoch is wired, a valid block/vote signed by an older-but-unpruned epoch's committee would be rejected (data.State.PushBlock even fails the epochInSet(blockEp, preEps) check with "not in window"). The existing TODO here acknowledges this; please ensure the neighbor-epoch expansion lands together with AddEpoch.


Summary
epoch.Registrywhich consolidatesfirstBlockandgenesisTimestamp— genesis config that doesn't change across epochs — alongside the committeefirstBlockintoProposalsoGlobalRange()returns absolute block numbers directly, removing it as a call-site parameter throughout the stack. The proto message is now self-contained on the wire, making it much easier to interpret during emergency debugging. Also fixes a reproposal bug whereTimeoutQC.reproposal()was double-countingfirstBlockregistry.CommitteeFor(roadIndex). The registry currently returns the genesis committee for all RoadIndexes — we wire the read side first so that any bug introduced when implementing dynamic epochs in follow-up PRs is immediately visible at the call sites rather than requiring a cascading refactor to find it🤖 Generated with Claude Code