Skip to content

HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing#8339

Open
SwaraliJoshi wants to merge 1 commit into
apache:masterfrom
SwaraliJoshi:HBASE-28659
Open

HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing#8339
SwaraliJoshi wants to merge 1 commit into
apache:masterfrom
SwaraliJoshi:HBASE-28659

Conversation

@SwaraliJoshi

@SwaraliJoshi SwaraliJoshi commented Jun 11, 2026

Copy link
Copy Markdown

Problem statement

A ServerCrashProcedure (SCP) can crash the PEWorker with a NullPointerException while processing a dead RegionServer, after a Master restart:

ERROR [PEWorker-15] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception:
  pid=48, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true;
  ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true, meta=true
java.lang.NullPointerException: null
  at o.a.h.h.master.assignment.RegionStates.setServerState(RegionStates.java:409)
  at o.a.h.h.master.assignment.RegionStates.logSplitting(RegionStates.java:435)
  at o.a.h.h.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:226)

setServerState dereferences the result of getServerNode(serverName) inside synchronized (serverNode), but getServerNode is documented to return null when no node exists:

private void setServerState(ServerName serverName, ServerState state) {
  ServerStateNode serverNode = getServerNode(serverName);
  synchronized (serverNode) {   // NPE when serverNode == null
    serverNode.setState(state);
  }
}

This is worse than a single failed procedure: the SCP does not support rollback, so the uncaught exception triggers an unsupported rollback (... does not support rollback but the execution failed and try to rollback, code bug?). The crashed server's WAL split and region reassignment never complete — and since this SCP carried meta, meta recovery is abandoned, which can wedge the cluster until manual intervention.

Root cause

ServerStateNodes live only in the in-memory RegionStates.serverMap; they are never persisted. On Master startup the map is rebuilt only from:

  • RegionServerTracker.upgrade() → createServer() for the live set: rsListStorage.getAll() ∪ MasterWalManager.getLiveServersFromWALDir() (WAL dirs that do not end in -splitting),
  • RegionServers re-registering (ServerManager.recordNewServerWithLock),
  • region locations loaded from hbase:meta.

Servers that only have an outstanding SCP are added to the deadservers list by findDeadServersAndProcess, but no ServerStateNode is created for them. So if a persisted SCP resumes for a server that:

  1. has its WAL dir already renamed to -splitting (counted as "splitting", explicitly excluded from "live"),
  2. has come back with a new start code (so re-registration creates a node for a different ServerName), and
  3. is no longer referenced by any region in meta,

…then getServerNode(oldServerName) returns null, and the next setServerState call (logSplitting at SERVER_CRASH_SPLIT_LOGS) NPEs.

This is deterministic given that on-disk/in-memory state — it is not a timing race. It is "rare" only because reaching that state requires a Master restart to land inside an SCP's WAL-splitting window. A 2.5.8 → 3.0 upgrade (rolling Master failovers while RegionServers bounce) widens that window dramatically, which is why the reporter only saw it during migration.

Timeline (from the attached hbase--master log)

Time (UTC) Event
10:43:51 hregion1,16020,1715424228375 registers → createServer() → ServerStateNode created
10:45:02 hregion1 znode deleted → expiration → SCP pid=48 scheduled (carrying meta); runs through SERVER_CRASH_SPLIT_META_LOGS while the node still exists
10:45:05 hregion2 (now hosting meta) aborts: Failed to open region hbase:meta ... can not recover → SCP pid=51 scheduled
10:45:08 Terminating master — the Master process itself goes down (external/harness event, unrelated to the SCP)
10:45:11 STARTING service HMaster — fresh Master boots; in-memory serverMap is empty
10:45:42 Upgrading RegionServerTracker ... 2 existing ServerCrashProcedures, 0 possibly 'live' servers, and 2 'splitting' — old hregion1 is in the splitting set, not live, so no node is recreated
10:45:49 hregion1 re-registers with a new start code ...343997 → node created for the new name only
10:45:57 SCP pid=48 resumes for old ...228375 → logSplitting → getServerNode returns null → NPE → SCP fails → unsupported rollback

The Master restart and the SCP are independently triggered events (a RegionServer crash created the SCP; the Master process was terminated separately). The bug is the coincidence of the two.

Solution

Guard against a missing ServerStateNode in setServerState. Since the split-state it sets is in-memory bookkeeping (the only consumers check isInState(ServerState.ONLINE), and an absent/crashed server is not ONLINE), there is nothing to update when the node is gone — so skip rather than NPE.

This single guard covers all four split helpers that route through setServerState: metaLogSplitting, metaLogSplit, logSplitting, and logSplit.

This is consistent with how the rest of the class already treats a null server node as a tolerated, expected condition — e.g. AssignmentManager.submitServerCrash (explicit serverNode == null handling) and RegionStates.removeRegionFromServer (documented to allow a null node).

With the guard in place, the resumed SCP proceeds normally: WAL splitting is performed by the child SplitWALProcedure (which never needed the node), regions are reassigned, and SERVER_CRASH_FINISH → removeServer is a harmless no-op on the absent key. The SCP runs to completion instead of failing.

Impact

  • Eliminates the NullPointerException in setServerState during SCP replay after a Master restart/failover.
  • Prevents the more damaging downstream effect: an SCP failure → unsupported rollback → abandoned WAL split / meta recovery → stuck regions requiring manual recovery.
  • No behavior change on the normal path (when the ServerStateNode exists). The skipped state update is purely informational bookkeeping for a server that is already gone.
  • Low-risk and easily backportable (single localized guard), which matters because this bug surfaces during version upgrades.

This is a targeted, defensive fix. A complementary follow-up (tracked separately) could restore the invariant by recreating ServerStateNodes for servers with outstanding SCPs at startup, but that carries the ONLINE-vs-CRASHED initial-state subtlety and a wider blast radius, so it is intentionally out of scope here.

Testing

  • Added a regression test in TestRegionStates:
    • testLogSplittingNoOpWhenServerNodeMissing — reproduces the post-restart scenario (a server with no ServerStateNode) and asserts all four split helpers (metaLogSplitting, metaLogSplit, logSplitting, logSplit) are safe no-ops, do not throw, and do not implicitly create a node. Before the fix this threw NPE.
    • testLogSplittingOkWhenServerNodePresent — contrast case confirming normal behavior when the node exists.
  • Verified the SCP integration suites pass: TestSCP, TestSCPWithMeta (meta-carrying scenario from the JIRA), TestRollbackSCP.

@SwaraliJoshi SwaraliJoshi marked this pull request as draft June 11, 2026 07:12
@SwaraliJoshi SwaraliJoshi changed the title HBASE-28659 Fix NPE in hmaster (setServerState function) HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing Jun 11, 2026
@SwaraliJoshi SwaraliJoshi marked this pull request as ready for review June 11, 2026 10:57
@Apache9

Apache9 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

So this could only happen where is actually no regions on the region server? Otherwise when loading regions in AssignmentManger, we will create the server node?

And I think we should create the server node here since this is not something incorrect, a warning message here will confusing the users...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants