HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing by SwaraliJoshi · Pull Request #8339 · apache/hbase

SwaraliJoshi · 2026-06-11T07:12:07Z

Problem statement

A ServerCrashProcedure (SCP) can crash the PEWorker with a NullPointerException while processing a dead RegionServer, after a Master restart:

ERROR [PEWorker-15] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception:
  pid=48, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true;
  ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true, meta=true
java.lang.NullPointerException: null
  at o.a.h.h.master.assignment.RegionStates.setServerState(RegionStates.java:409)
  at o.a.h.h.master.assignment.RegionStates.logSplitting(RegionStates.java:435)
  at o.a.h.h.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:226)

setServerState dereferences the result of getServerNode(serverName) inside synchronized (serverNode), but getServerNode is documented to return null when no node exists:

private void setServerState(ServerName serverName, ServerState state) {
  ServerStateNode serverNode = getServerNode(serverName);
  synchronized (serverNode) {   // NPE when serverNode == null
    serverNode.setState(state);
  }
}

This is worse than a single failed procedure: the SCP does not support rollback, so the uncaught exception triggers an unsupported rollback (... does not support rollback but the execution failed and try to rollback, code bug?). The crashed server's WAL split and region reassignment never complete — and since this SCP carried meta, meta recovery is abandoned, which can wedge the cluster until manual intervention.

Root cause

ServerStateNodes live only in the in-memory RegionStates.serverMap; they are never persisted. On Master startup the map is rebuilt only from:

RegionServerTracker.upgrade() → createServer() for the live set: rsListStorage.getAll() ∪ MasterWalManager.getLiveServersFromWALDir() (WAL dirs that do not end in -splitting),
RegionServers re-registering (ServerManager.recordNewServerWithLock),
region locations loaded from hbase:meta.

Servers that only have an outstanding SCP are added to the deadservers list by findDeadServersAndProcess, but no ServerStateNode is created for them. So if a persisted SCP resumes for a server that:

has its WAL dir already renamed to -splitting (counted as "splitting", explicitly excluded from "live"),
has come back with a new start code (so re-registration creates a node for a different ServerName), and
is no longer referenced by any region in meta,

…then getServerNode(oldServerName) returns null, and the next setServerState call (logSplitting at SERVER_CRASH_SPLIT_LOGS) NPEs.

This is deterministic given that on-disk/in-memory state — it is not a timing race. It is "rare" only because reaching that state requires a Master restart to land inside an SCP's WAL-splitting window. A 2.5.8 → 3.0 upgrade (rolling Master failovers while RegionServers bounce) widens that window dramatically, which is why the reporter only saw it during migration.

Timeline (from the attached hbase--master log)

Time (UTC)	Event
10:43:51	hregion1,16020,1715424228375 registers → createServer() → ServerStateNode created
10:45:02	hregion1 znode deleted → expiration → SCP pid=48 scheduled (carrying meta); runs through SERVER_CRASH_SPLIT_META_LOGS while the node still exists
10:45:05	hregion2 (now hosting meta) aborts: Failed to open region hbase:meta ... can not recover → SCP pid=51 scheduled
10:45:08	Terminating master — the Master process itself goes down (external/harness event, unrelated to the SCP)
10:45:11	STARTING service HMaster — fresh Master boots; in-memory serverMap is empty
10:45:42	Upgrading RegionServerTracker ... 2 existing ServerCrashProcedures, 0 possibly 'live' servers, and 2 'splitting' — old hregion1 is in the splitting set, not live, so no node is recreated
10:45:49	hregion1 re-registers with a new start code ...343997 → node created for the new name only
10:45:57	SCP pid=48 resumes for old ...228375 → logSplitting → getServerNode returns null → NPE → SCP fails → unsupported rollback

The Master restart and the SCP are independently triggered events (a RegionServer crash created the SCP; the Master process was terminated separately). The bug is the coincidence of the two.

Solution

Guard against a missing ServerStateNode in setServerState. Since the split-state it sets is in-memory bookkeeping (the only consumers check isInState(ServerState.ONLINE), and an absent/crashed server is not ONLINE), there is nothing to update when the node is gone — so skip rather than NPE.

This single guard covers all four split helpers that route through setServerState: metaLogSplitting, metaLogSplit, logSplitting, and logSplit.

This is consistent with how the rest of the class already treats a null server node as a tolerated, expected condition — e.g. AssignmentManager.submitServerCrash (explicit serverNode == null handling) and RegionStates.removeRegionFromServer (documented to allow a null node).

With the guard in place, the resumed SCP proceeds normally: WAL splitting is performed by the child SplitWALProcedure (which never needed the node), regions are reassigned, and SERVER_CRASH_FINISH → removeServer is a harmless no-op on the absent key. The SCP runs to completion instead of failing.

Impact

Eliminates the NullPointerException in setServerState during SCP replay after a Master restart/failover.
Prevents the more damaging downstream effect: an SCP failure → unsupported rollback → abandoned WAL split / meta recovery → stuck regions requiring manual recovery.
No behavior change on the normal path (when the ServerStateNode exists). The skipped state update is purely informational bookkeeping for a server that is already gone.
Low-risk and easily backportable (single localized guard), which matters because this bug surfaces during version upgrades.

This is a targeted, defensive fix. A complementary follow-up (tracked separately) could restore the invariant by recreating ServerStateNodes for servers with outstanding SCPs at startup, but that carries the ONLINE-vs-CRASHED initial-state subtlety and a wider blast radius, so it is intentionally out of scope here.

Testing

Added a regression test in TestRegionStates:
- testLogSplittingNoOpWhenServerNodeMissing — reproduces the post-restart scenario (a server with no ServerStateNode) and asserts all four split helpers (metaLogSplitting, metaLogSplit, logSplitting, logSplit) are safe no-ops, do not throw, and do not implicitly create a node. Before the fix this threw NPE.
- testLogSplittingOkWhenServerNodePresent — contrast case confirming normal behavior when the node exists.
Verified the SCP integration suites pass: TestSCP, TestSCPWithMeta (meta-carrying scenario from the JIRA), TestRollbackSCP.

Apache9 · 2026-06-13T08:52:54Z

So this could only happen where is actually no regions on the region server? Otherwise when loading regions in AssignmentManger, we will create the server node?

And I think we should create the server node here since this is not something incorrect, a warning message here will confusing the users...

HBASE-28659 Fix NPE in hmaster (setServerState function)

a3fdab8

SwaraliJoshi marked this pull request as draft June 11, 2026 07:12

SwaraliJoshi changed the title ~~HBASE-28659 Fix NPE in hmaster (setServerState function)~~ HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing Jun 11, 2026

SwaraliJoshi marked this pull request as ready for review June 11, 2026 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing#8339

HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing#8339
SwaraliJoshi wants to merge 1 commit into
apache:masterfrom
SwaraliJoshi:HBASE-28659

SwaraliJoshi commented Jun 11, 2026 •

edited

Loading

Uh oh!

Apache9 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SwaraliJoshi commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem statement

Root cause

Timeline (from the attached hbase--master log)

Solution

Impact

Testing

Uh oh!

Apache9 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SwaraliJoshi commented Jun 11, 2026 •

edited

Loading