Skip to content

Add offline rollback utility for LittDB#3684

Open
Kbhat1 wants to merge 2 commits into
mainfrom
litt-offline-rollback
Open

Add offline rollback utility for LittDB#3684
Kbhat1 wants to merge 2 commits into
mainfrom
litt-offline-rollback

Conversation

@Kbhat1

@Kbhat1 Kbhat1 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Describe your changes and provide context

  • RollbackLittDB(dataDirs, filter) — offline tool that rewinds a littDB to a chosen point (e.g. a block height) while the DB is stopped
  • Walks each table's key files newest -> oldest, calls filter(key, isPrimary); first match = cutoff -> keeps that key's group + everything older, deletes everything newer from the segment files
  • Discards the keymap + snapshot dirs so the DB rebuilds them from the truncated segments on next start

Testing performed to validate your change

  • Unit tests
  • Currently verifying on node

Adds rollback.RollbackLittDB(dataDirs, rollbackFilter): a static, offline
utility that rewinds a LittDB instance to a chosen point — e.g. to roll a
node's state back to a specific block height while the database is stopped.

For each table it walks the key files newest-to-oldest, invoking the filter
once per record (isPrimary is derived from the record's KeyKind). The first
record the filter accepts is the rollback point: that record's whole key
group and everything written before it are kept; everything written after the
group is permanently removed from the segment files.

- segment.Segment.RollbackToKeyCount truncates a sealed segment in place:
  atomic key-file swap, value-file truncation, metadata key-count update.
- The orchestrator locks the data dirs (so it won't touch a live DB), then
  discards the table's keymap and snapshot, deletes whole newer segments
  (highest index first), and truncates the rollback segment. The keymap and
  snapshot are derived state and are rebuilt from the truncated segments on the
  next start, which keeps them consistent with the rolled-back data — the same
  approach cli/prune.go uses to finalize an offline mutation.

Steps are ordered for crash-safety: discarding the keymap first ensures the DB
always rebuilds it from whatever segment state exists rather than trusting a
keymap that points into truncated data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 1, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Destructive, offline mutations to segment files and derived indexes; incorrect use or filter logic could permanently lose chain state, though locks and snapshot checks mitigate live-DB corruption.

Overview
Adds an offline rollback path for LittDB so operators can rewind on-disk state (e.g. to a block height) while the DB is stopped.

rollback.RollbackLittDB takes storage roots and a RollbackFilter. For each table it scans keys newest→oldest; the first filter match defines the cutoff, keeping that key’s primary/secondary group and all older data. Newer segments are deleted; the pivot segment is truncated via new Segment.RollbackToKeyCount (atomic key-file swap, value-file truncate, metadata update). Keymap and snapshot dirs are removed so the next startup rebuilds from segments (same pattern as offline prune). Directory locks, snapshot symlink refusal, and ordered steps aim for safe partial-failure behavior.

Unit tests cover mid-history rollback, no-match/no-op, filter errors, and secondary key groups.

Reviewed by Cursor Bugbot for commit a5c712d. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJul 1, 2026, 11:04 PM

}
if int(survivingKeyCount) == len(keys) {
// Nothing was written after the boundary; the segment is already in its target state.
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early exit skips metadata repair

Medium Severity

When survivingKeyCount equals the number of keys read from disk, RollbackToKeyCount returns immediately without updating segment metadata or re-running value-file truncation. After an interrupted rollback that already rewrote the key file but failed before metadata was written, a retry hits this path and leaves stale keyCount in the metadata file permanently, so reopened tables report an inflated KeyCount() even though data reads are correct.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c242ef6. Configure here.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 61.39241% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.18%. Comparing base (1c66d87) to head (a5c712d).

Files with missing lines Patch % Lines
sei-db/db_engine/litt/rollback/rollback.go 67.22% 21 Missing and 18 partials ⚠️
...ei-db/db_engine/litt/disktable/segment/rollback.go 43.58% 12 Missing and 10 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3684      +/-   ##
==========================================
- Coverage   59.29%   58.18%   -1.12%     
==========================================
  Files        2272     2180      -92     
  Lines      188210   177279   -10931     
==========================================
- Hits       111594   103144    -8450     
+ Misses      66565    64984    -1581     
+ Partials    10051     9151     -900     
Flag Coverage Δ
sei-chain-pr 67.48% <61.39%> (+10.83%) ⬆️
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ei-db/db_engine/litt/disktable/segment/rollback.go 43.58% <43.58%> (ø)
sei-db/db_engine/litt/rollback/rollback.go 67.22% <67.22%> (ø)

... and 165 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

seidroid[bot]
seidroid Bot previously requested changes Jul 1, 2026

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-tested offline LittDB rollback utility, but two crash/edge-case correctness gaps remain: rollback ignores the durable gc-watermark sidecar (which can make the DB unstartable after a rollback below the watermark), and RollbackToKeyCount's early-return breaks its own idempotency guarantee if interrupted mid-operation.

Findings: 2 blocking | 4 non-blocking | 2 posted inline

Blockers

  • gc-watermark not reconciled during rollback (sei-db/db_engine/litt/rollback/rollback.go, discardDerivedState / rollbackTable): the durable <root>/<table>/gc-watermark file records lowestReadableSegment and is loaded at startup (disk_table.go:229-256). Rollback deletes/truncates high-index segments but never touches this file. If the rollback point lands at or below the recorded watermark, the new highestSegmentIndex falls below lowestReadableSegment, and startup fails with gc-watermark ... exceeds highest segment ... on disk — leaving the DB unstartable. Rollback should clamp the watermark down to the new highest surviving segment (it cannot simply delete it, since below-watermark segments still on disk would resurrect GC'd keys).
  • 1 blocking issue(s) flagged inline on specific lines.

Non-blocking

  • REVIEW_GUIDELINES.md is empty/missing, so no repo-specific standards were applied.
  • cursor-review.md is empty — the Cursor second-opinion pass produced no output.
  • Consider a targeted test for the crash-recovery/idempotency path (interrupt after key-file swap, rerun) and for interaction with a previously-advanced gc-watermark; current tests exercise only clean, complete rollbacks.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

// snapshot on its next start, so deleting them forces both back into sync with the truncated segments.
// Leaving them would let a stale keymap reference discarded keys, or let a snapshot's hard links pin the
// rolled-back data on disk. Removing a directory that does not exist is a no-op.
func discardDerivedState(roots []string, tableName string) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] discardDerivedState removes the keymap and snapshot (hard-link) directories but not the durable gc-watermark sidecar, which lives at <root>/<tableName>/gc-watermark (see GCWatermarkFileName, loaded per-root in disk_table.go:229). The watermark records lowestReadableSegment; after a rollback deletes/truncates high-index segments, the recorded value can exceed the new highestSegmentIndex, and startup then refuses to open the table (gc-watermark (lowest readable segment N) exceeds highest segment M on disk, disk_table.go:252-257), making the DB unstartable. Reconcile the watermark here: clamp it down to the new highest surviving segment index for each root when it exceeds that value. (Removing it outright is unsafe — below-watermark segments still on disk would resurrect GC-collected keys.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible you started this branch before my keymap threading optimizations landed. The gc-watermark file was something introduced in my branch.

return fmt.Errorf("surviving key count %d exceeds the %d records in segment %d",
survivingKeyCount, len(keys), s.index)
}
if int(survivingKeyCount) == len(keys) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] This early return runs after the atomic key-file swap (step 1) but before value-file truncation (step 2) and the metadata keyCount rewrite (step 3). If the process is interrupted between step 1 and steps 2/3, re-running the same rollback reads the already-truncated key file, sees survivingKeyCount == len(keys), and returns nil — so the value files keep their trailing dead bytes (leaked disk) and metadata.keyCount stays inflated. Note keyCount is read verbatim from metadata for already-sealed segments on load (segment.go:236; sealLoadedSegment, which recomputes it, is skipped when metadata.sealed), so the stale count is not self-corrected — contradicting the idempotency guarantee documented in this method and in RollbackLittDB. Consider detecting an incomplete prior run (e.g. value-file size larger than the surviving prefix, or metadata keyCount != len(keys)) and finishing steps 2/3 rather than returning early.

@cody-littley cody-littley left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once comments are addressed

Comment on lines +31 to +32
// For every table found under dataDirs, RollbackLittDB walks the key files from newest to oldest and
// invokes rollbackFilter for each key. The first key for which the filter returns true marks the rollback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this walks multiple tables, then we need to add a table argument to the RollbackFilter. Key schema may be different across different tables.

// snapshot on its next start, so deleting them forces both back into sync with the truncated segments.
// Leaving them would let a stale keymap reference discarded keys, or let a snapshot's hard links pin the
// rolled-back data on disk. Removing a directory that does not exist is a no-op.
func discardDerivedState(roots []string, tableName string) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional:

Instead of dropping the keymap directory wholesale, we can prune the keymap to avoid the cost of rebuilding it. To do this, we'd iterate the keys we are removing, and issue batch deletion operations to the keymap. We'd need to make sure we delete the keys before the segments though, in order to ensure crash safety.

This may be a premature optimization, as rolling back LittDB data is likely going to be rare. I'll leave it up to you if you want to do this or not.

// snapshot on its next start, so deleting them forces both back into sync with the truncated segments.
// Leaving them would let a stale keymap reference discarded keys, or let a snapshot's hard links pin the
// rolled-back data on disk. Removing a directory that does not exist is a no-op.
func discardDerivedState(roots []string, tableName string) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible you started this branch before my keymap threading optimizations landed. The gc-watermark file was something introduced in my branch.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

There are 4 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a5c712d. Configure here.

}
}
}
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale gc-watermark blocks startup

High Severity

discardDerivedState removes the keymap and snapshot dirs but leaves the table-root gc-watermark file intact. After rollback shrinks the highest segment index below the durable watermark, OpenDiskTable rejects startup with a gc-watermark/segment mismatch, or reload skips segments below the stale floor even though their files were kept.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a5c712d. Configure here.

// primaries and primaries that own secondary keys) and false for secondary keys. The first record for
// which the filter returns true is the rollback point: that record's group is kept along with everything
// written before it, and everything written after the group is discarded.
type RollbackFilter func(key []byte, isPrimary bool) (bool, error)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter lacks table name

Medium Severity

RollbackFilter receives only key and isPrimary, yet RollbackLittDB applies the same callback to every table. Tables use different key encodings, so a block-height filter cannot choose the correct rollback point per table and may leave some tables unchanged or truncate others at the wrong boundary.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a5c712d. Configure here.

for _, table := range tables {
if err := rollbackTable(logger, roots, table, rollbackFilter); err != nil {
return fmt.Errorf("failed to roll back table %q: %w", table, err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial multi-table rollback

Medium Severity

RollbackLittDB rolls back tables sequentially and returns on the first error. Tables processed earlier have already had derived state removed and segments truncated, while later tables may remain at the pre-rollback revision, leaving the database in a mixed-epoch state with no automatic recovery.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a5c712d. Configure here.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-structured offline LittDB rollback utility with careful crash-ordering and good test coverage; no confirmed correctness bugs, but the stateless filter cannot express "delete an entire table," which is a real gap for the stated block-height use case, plus a couple of minor robustness notes.

Findings: 0 blocking | 6 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Codex's second concern (rollback ignoring a gc-watermark / durable-readable floor) appears mitigated here: boundary/lower-bound files are a snapshot-only concept and rollbackTable explicitly refuses symlinked snapshots, while the main store re-derives its segment range from disk on reopen (GatherSegmentFiles/scanDirectories) and the keymap is rebuilt — so deleting the newest segments does not leave a dangling high-water pointer. Worth a confirming glance, but not blocking.
  • The Cursor second-opinion review (cursor-review.md) was empty — that pass produced no output.
  • RollbackLittDB rolls back every table found under the roots unconditionally; unlike cli/prune.go it offers no table allowlist. For a whole-node height rollback that is the intent, but consider exposing table scoping to reduce blast radius for narrower operational use.
  • No prompt-injection or malicious content detected in the PR title, description, or diff.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

if err != nil {
return err
}
if pivot == nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Semantics gap for the block-height use case: the filter is stateless per-key and can only say "keep from here down." A table whose keys are ALL newer than the rollback target never matches, so pivot == nil leaves it fully intact — the opposite of what a height rollback wants (that data should be deleted). The filter cannot distinguish "no match because everything is older (keep all)" from "no match because everything is newer (delete all)."

In practice most Sei tables are created at genesis and will always have a key at/before the target, so this may not trigger — but a table that first received writes after the target height would silently retain post-rollback state, risking an inconsistent app hash. Consider either documenting that the caller must guarantee every table has at least one record at/before the target, or extending the API so a caller can signal "delete this whole table."

return fmt.Errorf("surviving key count %d exceeds the %d records in segment %d",
survivingKeyCount, len(keys), s.index)
}
if int(survivingKeyCount) == len(keys) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Interruption edge case: if a prior run crashed after the atomic key-file swap (step 1) but before truncating the value files (step 2) / writing metadata (step 3), a re-run reads the already-truncated key file, so survivingKeyCount == len(keys) here and it returns early — leaving the value files over-sized and metadata.keyCount stale permanently. This is not corruption (the trailing value bytes are unreferenced and keyCount only feeds metrics), but the docstring's idempotency claim ("self-corrects on the next run") doesn't fully hold for these two artifacts. Worth a one-line caveat in the comment.

@seidroid seidroid Bot dismissed their stale review July 1, 2026 23:10

Superseded: latest AI review found no blocking issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants