Skip to content

HBASE-29435 Add MapreduceRestoreSnapshotHelper.java to guard against accidental data loss#8248

Open
SwaraliJoshi wants to merge 3 commits into
apache:masterfrom
SwaraliJoshi:W-22336363-hbase-29435-limit-hfile-actions
Open

HBASE-29435 Add MapreduceRestoreSnapshotHelper.java to guard against accidental data loss#8248
SwaraliJoshi wants to merge 3 commits into
apache:masterfrom
SwaraliJoshi:W-22336363-hbase-29435-limit-hfile-actions

Conversation

@SwaraliJoshi

@SwaraliJoshi SwaraliJoshi commented May 18, 2026

Copy link
Copy Markdown

Problem

When a MapReduce snapshot-scanning job is misconfigured with restoreDir pointing to (or under) the production HBase data directory (hbase.rootdir/data/), RestoreSnapshotHelper.copySnapshotForScanner() creates HFileLink reference files directly inside the live data directory. When the MR job ends and cleanup triggers HFileArchiver, production HFiles get moved into the archive, where HFileCleaner (default TTL: 5 minutes) permanently deletes them — causing silent, irreversible data loss.

The existing same-filesystem and sub-directory checks in RestoreSnapshotHelper did not catch the exact-match case (restoreDir == rootDir), and the resulting error message gave no hint about the data-loss risk.

JIRA HBASE-29435

Current Impact (Without Fix)

A single misconfigured restoreDir parameter (e.g., set to hbase.rootdir or hbase.rootdir/data/...) can cause permanent production data loss.
The failure mode is silent — no exception, no warning; data simply disappears after the HFileCleaner TTL elapses.
Every MapReduce snapshot-scanning path is exposed: TableSnapshotInputFormat, MultiTableSnapshotInputFormat, and VerifyReplication (peer cluster).
Stricter validation cannot simply be added to RestoreSnapshotHelper itself — Master-side procedures (RestoreSnapshotProcedure, CloneSnapshotProcedure) legitimately invoke that helper against the production hbase.rootdir, and tightening it would break those flows.

Approach

An earlier revision of this PR forked the entire RestoreSnapshotHelper (~830 lines) into the MapReduce module. Following review feedback (thanks @gvprathyusha6 and @sairampola), it has been reworked so that no security-critical core restore/clone logic is duplicated:

  1. Pluggable archiver, no forked restore logic. A new RestoreSnapshotArchiver interface (in hbase-server) abstracts the only operations that differ between the server-side and MapReduce restore paths — region and family archiving. RestoreSnapshotHelper now routes its two archive call sites through an injectable RestoreSnapshotArchiver, defaulting to one backed by HFileArchiver. Master-side RestoreSnapshotProcedure / CloneSnapshotProcedure keep using the default and are behaviorally unchanged.

  2. Thin MapReduce wrapper. MapreduceRestoreSnapshotHelper is now a small wrapper with no copied logic. It performs the MR-specific safety guard and then delegates to RestoreSnapshotHelper.copySnapshotForScanner(..., archiver), injecting the MR-local MapreduceHFileArchiver.

  3. Tighter path validation. The guard rejects both the sub-directory case and the previously-uncovered exact-match case (restoreDir == rootDir), logs at ERROR for ops visibility, then throws IllegalArgumentException with a clear data-loss message and remediation — before any filesystem work:

String rootPath = rootDir.toUri().getPath();
String restorePath = restoreDir.toUri().getPath();
if (restorePath.equals(rootPath) || restorePath.startsWith(rootPath + "/")) {
  String message = "BLOCKED: MapReduce restore directory cannot be the HBase root directory "
    + "or a sub directory of it. This could lead to accidental archival and permanent "
    + "data loss if the path falls under " + rootDir + "/data/. Use a temporary directory "
    + "outside of hbase.rootdir for MR snapshot scanning. RootDir: " + rootDir
    + ", restoreDir: " + restoreDir;
  LOG.error(message);
  throw new IllegalArgumentException(message);
}
  1. All in-module call sites migrated to the MR-local helper: TableSnapshotInputFormatImpl.setInput(), MultiTableSnapshotInputFormatImpl, and VerifyReplication.restoreSnapshotForPeerCluster(). CompactionTool is intentionally left untouched — it operates on production data by design — and now carries a comment documenting that it is exempt from the HBASE-29435 guard.

Note: every shipped caller restores into a fresh, unique (UUID) directory, so the operation always takes the non-destructive clone path. The guard is a fail-fast safety net for misconfigured or custom callers that pass a pre-populated / production directory.

Impact After Fix

  • Misconfigured restoreDir (whether equal to hbase.rootdir or any sub-path of it) immediately throws IllegalArgumentException with a clear, HBASE-29435-tagged error message.
  • The job fails fast before any files are created or modified in production directories — zero risk of accidental archival or data loss from MR snapshot scanning paths.
  • LOG.error(message) provides clear ops visibility — blocked attempts surface in MR job logs without requiring stack-trace parsing.
  • No impact on legitimate operations: Master's RestoreSnapshotProcedure and CloneSnapshotProcedure continue to use the original RestoreSnapshotHelper and remain unchanged.
  • No performance overhead — two string comparisons at job-setup time.
  • No core restore/clone logic is duplicated; the ~1,400-line fork is avoided. Master-side restore/clone procedures use the unchanged default archiver.
Entry Point Coverage
TableSnapshotInputFormatImpl.setInput() Single-table snapshot scans, mapred API wrappers, ClientSideRegionScanner
MultiTableSnapshotInputFormatImpl.setInput() Multi-table snapshot scans
VerifyReplication.restoreSnapshotForPeerCluster() Replication verification with peer snapshots

Files Changed

File Change
snapshot/RestoreSnapshotArchiver.java (new, hbase-server) Pluggable archiver strategy; DEFAULT backed by HFileArchiver
snapshot/RestoreSnapshotHelper.java (hbase-server) Routes archiving through the injectable archiver; adds a copySnapshotForScanner(..., archiver) overload and an archiver-accepting constructor
util/MapreduceHFileArchiver.java (hbase-mapreduce) MR-local archiver; now implements RestoreSnapshotArchiver; added ASF license header + @InterfaceAudience.Private; removed dead archiveExecutor field
mapreduce/MapreduceRestoreSnapshotHelper.java (hbase-mapreduce) Rewritten as a thin guard + delegate (no copied core logic)
mapreduce/TableSnapshotInputFormatImpl.java Routes restore through the MR helper
mapreduce/MultiTableSnapshotInputFormatImpl.java Routes restore through the MR helper
mapreduce/replication/VerifyReplication.java Routes peer-cluster restore through the MR helper
regionserver/CompactionTool.java Comment documenting its intentional exemption from the guard
snapshot/TestRestoreSnapshotArchiver.java (new test) Verifies the injected archiver is used on region removal and not used on the fresh clone path; covers the default archiver
mapreduce/TestMapreduceRestoreSnapshotHelper.java (test) Adds guard throw-path tests (exact-match + sub-dir) plus clone/TTL delegation; removed unused imports/helpers
util/TestMapreduceHFileArchiver.java (new test) FS-level tests confirming the fork implements the interface and archives region/family correctly

Testing

All passing (run via mvn test on hbase-server and hbase-mapreduce):

Test Result
TestRestoreSnapshotArchiver 4/4
TestRestoreSnapshotHelper 6/6
TestMobRestoreSnapshotHelper 6/6
TestMapreduceRestoreSnapshotHelper 4/4
TestMapreduceHFileArchiver 4/4

JaCoCo coverage on the new/changed code: RestoreSnapshotArchiver 100% line/method; MapreduceRestoreSnapshotHelper 100% line/branch/method; MapreduceHFileArchiver 97% method coverage (happy-path archiving fully covered); the RestoreSnapshotHelper paths touched by this change are exercised.

@SwaraliJoshi SwaraliJoshi marked this pull request as draft May 18, 2026 06:50
@SwaraliJoshi SwaraliJoshi force-pushed the W-22336363-hbase-29435-limit-hfile-actions branch from 7f5b2a8 to 498e80b Compare May 25, 2026 05:48
// Guard: ensure restoreDir is not under production data directory (HBASE-29435)
MapReduceArchiveGuard.validateNotProductionDataPath(rootDir, restoreDir);

RestoreSnapshotHelper.copySnapshotForScanner(conf, fs, rootDir, restoreDir, snapshotName);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, these are the APIs that are being used across mapreduce module and server module (for the Snapshot procedure code), lets separate these as part of this change

… data

Introduce MapreduceRestoreSnapshotHelper, a MapReduce-local fork of
RestoreSnapshotHelper, so the MR module can layer additional safety
validation on snapshot restoration without affecting Master/server-side
restore and clone procedures, which legitimately operate against the
production hbase.rootdir.

In copySnapshotForScanner(), the path-prefix check is tightened to also
reject the exact-match case (restoreDir == rootDir), and the error
message is rewritten to call out the data-loss risk, the remediation
(use a temporary directory outside hbase.rootdir), and HBASE-29435 for
traceability. The guard now also logs at ERROR level before throwing so
ops can grep blocked attempts in MR job logs.

Migrate all hbase-mapreduce call sites of
RestoreSnapshotHelper.copySnapshotForScanner to the new MR-local helper:
- TableSnapshotInputFormatImpl.setInput()
- MultiTableSnapshotInputFormatImpl.setInput()
- VerifyReplication.restoreSnapshotForPeerCluster()
- TestTableSnapshotInputFormat (the test that directly invoked the helper)

CompactionTool is intentionally left untouched as it operates on
production data by design.

Add TestMapreduceRestoreSnapshotHelper covering the migration and
validation logic. The hbase-mapreduce module now depends on
hbase-procedure's test-jar so MR tests can use ProcedureTestingUtility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SwaraliJoshi SwaraliJoshi force-pushed the W-22336363-hbase-29435-limit-hfile-actions branch from 498e80b to 2c76178 Compare May 26, 2026 07:23
@SwaraliJoshi SwaraliJoshi changed the title [WIP] HBASE-29435 Add RestoreSnapshotHFileArchiver to guard against accidental data loss HBASE-29435 Add RestoreSnapshotHFileArchiver to guard against accidental data loss Jun 1, 2026
@SwaraliJoshi SwaraliJoshi changed the title HBASE-29435 Add RestoreSnapshotHFileArchiver to guard against accidental data loss HBASE-29435 Add MapreduceRestoreSnapshotHelper.java to guard against accidental data loss Jun 1, 2026
@SwaraliJoshi SwaraliJoshi marked this pull request as ready for review June 1, 2026 05:43
* Restore the on-disk table to a specified snapshot state.
* @return the set of regions touched by the restore operation
*/
public MapreduceRestoreSnapshotHelper.RestoreMetaChanges restoreHdfsRegions() throws IOException {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont need these RestoreMetaChanges for map reduce right? Are we using it somewhere ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not using the RestoreMetaChanges anywhere in map reduce. I have removed them in the latest commit.

Comment on lines +624 to +642
private StoreFileInfo restoreStoreFile(final Path familyDir, final RegionInfo regionInfo,
final SnapshotProtos.SnapshotRegionManifest.StoreFile storeFile, final boolean createBackRef,
final StoreFileTracker tracker) throws IOException {
String hfileName = storeFile.getName();
StoreFileInfo info = null;
if (HFileLink.isHFileLink(hfileName) || StoreFileInfo.isMobFileLink(hfileName)) {
HFileLink hfileLink = tracker.createFromHFileLink(hfileName, createBackRef);
info = new StoreFileInfo(conf, fs, new Path(familyDir, hfileName), hfileLink);
return info;
} else if (StoreFileInfo.isReference(hfileName)) {
return restoreReferenceFile(familyDir, regionInfo, storeFile, tracker);
} else {
HFileLink hfileLink = tracker.createAndCommitHFileLink(regionInfo.getTable(),
regionInfo.getEncodedName(), hfileName, createBackRef);
return new StoreFileInfo(conf, fs, new Path(familyDir, HFileLink
.createHFileLinkName(regionInfo.getTable(), regionInfo.getEncodedName(), hfileName)),
hfileLink);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the jira mentions to have a new set of helper classes, but on the second thoughts core logics like these should still be same across mapreduce and procedures, I think it is good to have both MapReduceRestoreSnapshotHelper, RestoreSnapshotHelper
but lets not copy core logics and call the RestoreSnapshotHelper itself from MapReduceRestoreSnapshotHelper and send in the MapReduceHFileArchiver as a parameter to the methods which involve HFileArchiving. @SwaraliJoshi

@sairampola sairampola left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this — the failure mode is real and nasty (silent, TTL-delayed production data loss), and the diagnosis of the uncovered exact-match case is correct. A few things I'd want addressed before this is mergeable, grouped by severity.

Blocking (CI will fail on each independently)

  1. Missing Apache license headerhbase-mapreduce/.../util/MapreduceHFileArchiver.java starts directly at package …; the ASF header block was dropped. apache-rat:check will fail. (MapreduceRestoreSnapshotHelper kept its header.)
  2. Missing @InterfaceAudience annotationMapreduceHFileArchiver is a public class with no audience annotation; the original HFileArchiver has @InterfaceAudience.Private and checkstyle enforces it. The import was dropped too.
  3. Line length > 100 cols — several lines in both new files exceed the limit (e.g. MapreduceRestoreSnapshotHelper L170, L330, L782, L823; MapreduceHFileArchiver L42, L107, L178), mostly from expanding nested-type names to fully-qualified form. mvn spotless:apply + dropping the redundant self-qualifier will fix it.
  4. Unused imports in the new testTestMapreduceRestoreSnapshotHelper has several unused imports (MetaTableAccessor, SnapshotManifest, MonitoredTask, HRegion, FSTableDescriptors, WALSplitUtil, ForeignExceptionDispatcher, MergeTableRegionsProcedure, Mockito), which fails checkstyle UnusedImports.

Design / correctness

  1. Most of the new validation is unreachable, and the HFileArchiver fork may be unnecessary. In copySnapshotForScanner, the existing check already rejects the sub-directory case:

    if (restoreDir.toUri().getPath().startsWith(rootDir.toUri().getPath() + "/")) {
      throw new IllegalArgumentException("Restore directory cannot be a sub directory ...");
    }

    The new block re-tests that same startsWith(rootPath + "/") condition, so that half is dead — the only genuinely new behavior is restorePath.equals(rootPath) (exact match). Suggest collapsing the two blocks into one guard with the better data-loss message.

    More broadly: the guard runs before the helper is constructed, and MapreduceHFileArchiver still archives/deletes exactly as the original does. Since the dangerous archival of production HFiles is only reachable when restoreDir == rootDir (or under it) — which the guard now rejects up front — it's not clear the ~570-line archiver fork buys any additional safety. Given hbase-mapreduce already depends on hbase-server (so the original HFileArchiver/RestoreSnapshotHelper are on the classpath), could the fix be just the guard on the MR call path, dropping MapreduceHFileArchiver entirely? Forking ~1,400 lines of security-critical code that will silently diverge from upstream is a significant maintenance cost, and I'd want a strong reason before taking it on.

  2. Guard comparison is authority-blind, weakest on the VerifyReplication peer path. The checks use toUri().getPath(), which strips scheme + authority. For the peer-cluster flow (different NameNode, operator-supplied location args) safety rests entirely on the filesystem-URI equality check; trailing slashes and .. traversal can also slip past a raw startsWith. The original HFileArchiver.archiveRecoveredEdits asserted fs scheme consistency before archiving — that's not carried here. Suggest comparing fs.makeQualified(restoreDir) vs fs.makeQualified(rootDir) (normalizes + includes authority) and asserting the passed fs matches rootDir's filesystem.

  3. The headline behavior isn't tested. The only assertThrows covers SnapshotTTLExpiredException (a pre-existing path). There's no assertion that the new guard throws for restoreDir == rootDir or a true sub-dir. testNoHFileLinkInRootDir uses restoreDir = "/hbase/.tmp-restore" against a deep getDefaultRootDirPath(), so it exercises the pass-through branch, not the throw. These can be fast unit tests (the guard throws before any FS work):

    • exact match (restoreDir == rootDir) throws
    • sub-dir (new Path(rootDir, "data/.tmp")) throws
    • sibling dir is allowed (negative contract)
    • different filesystem throws

    Asserting on a distinguishing substring of each message keeps the cases from passing on the wrong branch.

  4. PR description doesn't match the diff. It states hbase-mapreduce/pom.xml was changed to add an hbase-procedure test-jar "so MR tests can use ProcedureTestingUtility," but pom.xml isn't in the diff, the test never references ProcedureTestingUtility, and it compiles against the current pom. The "Files Changed" table also omits MapreduceHFileArchiver.java. Worth correcting so reviewers know the real surface area.

  5. Leftover dead code from the copy. archiveExecutor in MapreduceHFileArchiver is now an orphaned field (its users were removed). copySnapshotForScanner is public but returns the now-private RestoreMetaChanges, and all call sites discard it — returning void would be cleaner. The test also carries several unused private helpers (createAndAssertSnapshot, flipCompactions, getMasterProcedureExecutor, createSnapshotMock, verifyRestore, getReferredToFile).

Suggestions

  • testCopyExpiredSnapshotForScanner sleeps up to ~60s of wall-clock in a MediumTests test, which risks exceeding the category timeout on a busy CI agent. Injecting a manual EnvironmentEdge would make it deterministic and sub-second.
  • LOG.error(message) immediately before throw new IllegalArgumentException(message) double-logs the same text; the sibling checks throw without it.
  • A couple of fs.delete(...) boolean returns are ignored in the restore tree — low impact (throwaway dir), but a WARN on false would match surrounding conventions.

What's good

  • The fail-fast guard logs at ERROR and throws with a clear remediation + JIRA reference before any filesystem mutation.
  • The archiver's core data-loss guarantee (per-file failure → failures queue → FailedArchiveException) is preserved intact through the fork.
  • TestTableSnapshotInputFormat.testReadFromRestoredSnapshotViaMR is a solid contract test (restore → scan → succeed, then delete restore dir → assert failure).

Overall: the safety goal looks achievable with a small, well-tested guard; I'd lean toward minimizing (or avoiding) the fork and adding direct tests for the throw paths. Thanks again for catching this.

Rework the earlier RestoreSnapshotHelper fork so no core restore/clone
logic is duplicated:

- Add RestoreSnapshotArchiver strategy in hbase-server (default backed by
  HFileArchiver) and route RestoreSnapshotHelper's region/family archiving
  through it; add a copySnapshotForScanner overload that accepts an archiver.
- Rewrite MapreduceRestoreSnapshotHelper as a thin wrapper: validate that
  restoreDir is not the HBase root directory (or a sub directory of it),
  then delegate to RestoreSnapshotHelper injecting MapreduceHFileArchiver.
- MapreduceHFileArchiver now implements RestoreSnapshotArchiver; add the
  Apache license header and @InterfaceAudience.Private; drop the dead
  archiveExecutor field.
- Document that CompactionTool intentionally operates on production data and
  is exempt from the guard.
- Add TestRestoreSnapshotArchiver and TestMapreduceHFileArchiver; add guard
  throw-path tests and remove unused imports/helpers in
  TestMapreduceRestoreSnapshotHelper.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants