HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing#8339
Open
SwaraliJoshi wants to merge 1 commit into
Open
HBASE-28659 Fix NPE in hmaster RegionStates.setServerState when the ServerStateNode is missing#8339SwaraliJoshi wants to merge 1 commit into
SwaraliJoshi wants to merge 1 commit into
Conversation
Contributor
|
So this could only happen where is actually no regions on the region server? Otherwise when loading regions in AssignmentManger, we will create the server node? And I think we should create the server node here since this is not something incorrect, a warning message here will confusing the users... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem statement
A ServerCrashProcedure (SCP) can crash the PEWorker with a NullPointerException while processing a dead RegionServer, after a Master restart:
setServerState dereferences the result of getServerNode(serverName) inside synchronized (serverNode), but getServerNode is documented to return null when no node exists:
This is worse than a single failed procedure: the SCP does not support rollback, so the uncaught exception triggers an unsupported rollback (
... does not support rollback but the execution failed and try to rollback, code bug?). The crashed server's WAL split and region reassignment never complete — and since this SCP carried meta, meta recovery is abandoned, which can wedge the cluster until manual intervention.Root cause
ServerStateNodes live only in the in-memory RegionStates.serverMap; they are never persisted. On Master startup the map is rebuilt only from:
Servers that only have an outstanding SCP are added to the deadservers list by findDeadServersAndProcess, but no ServerStateNode is created for them. So if a persisted SCP resumes for a server that:
…then getServerNode(oldServerName) returns null, and the next setServerState call (logSplitting at SERVER_CRASH_SPLIT_LOGS) NPEs.
This is deterministic given that on-disk/in-memory state — it is not a timing race. It is "rare" only because reaching that state requires a Master restart to land inside an SCP's WAL-splitting window. A 2.5.8 → 3.0 upgrade (rolling Master failovers while RegionServers bounce) widens that window dramatically, which is why the reporter only saw it during migration.
Timeline (from the attached hbase--master log)
The Master restart and the SCP are independently triggered events (a RegionServer crash created the SCP; the Master process was terminated separately). The bug is the coincidence of the two.
Solution
Guard against a missing ServerStateNode in setServerState. Since the split-state it sets is in-memory bookkeeping (the only consumers check isInState(ServerState.ONLINE), and an absent/crashed server is not ONLINE), there is nothing to update when the node is gone — so skip rather than NPE.
This single guard covers all four split helpers that route through setServerState: metaLogSplitting, metaLogSplit, logSplitting, and logSplit.
This is consistent with how the rest of the class already treats a null server node as a tolerated, expected condition — e.g. AssignmentManager.submitServerCrash (explicit serverNode == null handling) and RegionStates.removeRegionFromServer (documented to allow a null node).
With the guard in place, the resumed SCP proceeds normally: WAL splitting is performed by the child SplitWALProcedure (which never needed the node), regions are reassigned, and SERVER_CRASH_FINISH → removeServer is a harmless no-op on the absent key. The SCP runs to completion instead of failing.
Impact
This is a targeted, defensive fix. A complementary follow-up (tracked separately) could restore the invariant by recreating ServerStateNodes for servers with outstanding SCPs at startup, but that carries the ONLINE-vs-CRASHED initial-state subtlety and a wider blast radius, so it is intentionally out of scope here.
Testing