Skip to content

<fix>[compute]: allow no-nic HA fallback across clusters#4210

Open
zstack-robot-1 wants to merge 1 commit into
4.8.38from
sync/zstackio/fix/ZSTAC-72149
Open

<fix>[compute]: allow no-nic HA fallback across clusters#4210
zstack-robot-1 wants to merge 1 commit into
4.8.38from
sync/zstackio/fix/ZSTAC-72149

Conversation

@zstack-robot-1

Copy link
Copy Markdown
Collaborator

Summary

Fix no-NIC VM HA allocation on 4.8.38. HA first prefers the VM's current cluster, then retries without the cluster restriction only for Auto HA when vm.ha.across.clusters is enabled.

Tests

  • git diff --check
  • mvn -pl compute -am -DskipTests package
  • Real 3-host validation with NeverStop VM and NFS network isolation:
    • host2 NFS blocked: HA moved VM to host1 in the same cluster
    • host1 NFS blocked with host2 disabled: HA moved VM to host3 in another cluster
    • vm.ha.across.clusters=false: VM stayed stopped instead of crossing clusters

Jira

ZSTAC-72149

sync from gitlab !10169

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1af9af91-5964-4f7e-8e38-248bc364c429

📥 Commits

Reviewing files that changed from the base of the PR and between b63bfa0 and 06e1535.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/vm/VmAllocateHostForStoppedVmFlow.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/vm/HaStartVmWithNoNicCase.groovy
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/src/test/groovy/org/zstack/test/integration/kvm/vm/HaStartVmWithNoNicCase.groovy
  • compute/src/main/java/org/zstack/compute/vm/VmAllocateHostForStoppedVmFlow.java

Walkthrough

VmAllocateHostForStoppedVmFlow 改为先按 VM 规格计算并尝试分配到首选集群;若失败且满足无网卡/未指定 requiredHost/场景为 Auto 且配置允许,则重试不带首选集群。新增 allocateHost/handleReply/handleRetryReply 重构回包处理;新增集成测试 HaStartVmWithNoNicCase 覆盖三种 HA 启动场景。

变更

VM 主机分配与 HA 启动跨集群回退

层 / 文件 总结
主机分配流程与重试策略实现
compute/src/main/java/org/zstack/compute/vm/VmAllocateHostForStoppedVmFlow.java
引入 ResourceConfigFacade;run() 新增首选集群计算并在满足条件时执行首选集群失败后的重试;抽取 allocateHost() 组装并发送 DesignatedAllocateHostMsg,handleReply()/handleRetryReply() 负责回包处理与 VmInstanceVO(lastHostUuid/hostUuid/clusterUuid) 更新并推进或失败工作流。
集成测试基础设施与场景注册
test/src/test/groovy/org/zstack/test/integration/kvm/vm/HaStartVmWithNoNicCase.groovy 第 1–162 行
environment() 声明双集群、多宿主、主存储与 L2/L3 网络及三个测试 VM;test() 注册子场景;setup()/clean() 管理生命周期。
测试场景验证与工具方法
test/src/test/groovy/org/zstack/test/integration/kvm/vm/HaStartVmWithNoNicCase.groovy 第 163–240 行
实现 testPreferOriginalCluster()/testFallbackToAnotherCluster()/testNoFallbackWhenHaAcrossClustersDisabled() 三个断言场景;实现 detachAllNics(), haStartVm(), makeOriginalClusterUnavailable();新增 ZSTAC72149HaStartVmJudger 恒返回 true。
  • Possibly related PRs:
    • MatheMatrix/zstack#4027: 与 VmInstanceVO host 字段更新和 HA 预 fence 标签创建相关,代码层面存在关联。

🎯 3 (Moderate) | ⏱️ ~25 minutes

"我是小兔写代码,跳跃在集群间,
首选尝试若失落,跨簇重试再出发。
测试三景来验证,宿主回退与否分明,
笑看云端主机寻家路,胡萝卜庆功来一盘! 🥕🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题清晰准确地概括了变更的主要目的——允许无网卡HA虚拟机在集群间进行故障转移,直接反映了PR的核心功能改进。
Description check ✅ Passed 描述充分关联了变更内容,包括修复概览、测试说明、Jira追踪和来源,清楚地解释了为什么进行此改动以及如何验证。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/zstackio/fix/ZSTAC-72149

Comment @coderabbitai help to get the list of available commands and usage tips.

@ZStack-Robot

Copy link
Copy Markdown
Collaborator

Comment from yaohua.wu:

Review: MR !10169 — ZSTAC-72149

Background (preserved across rounds)

  • Jira: ZSTAC-72149 — 【VM HA】无网卡云主机无法跨集群 HA (Bug, P2)
  • Bug summary: 无网卡云主机开启 NeverStop HA 后,当原 cluster 内无可用 host 时,HA start 仍带着云主机当前 cluster 作为分配约束;即使共享存储可达、其他 cluster 有可用 host,也无法恢复。
  • Intent & scope: 在 VmAllocateHostForStoppedVmFlow 中做两阶段分配——先按当前 cluster 分配;首次失败且满足「Auto HA + 未指定 required host/cluster + 无网卡 + VM 级 vm.ha.across.clusters=true」时,去掉 cluster 约束重试一次。Scope: compute/.../VmAllocateHostForStoppedVmFlow.java(重构 run() + 抽出 4 个私有方法)+ 新增集成测试 HaStartVmWithNoNicCase.groovy
  • Round 1 initial findings: 0 × Critical, 1 × Warning(新增集成测试未本地执行),2 × Suggestion
  • Suggested fix direction: 合并前在 CI/稳定性框架跑通 HaStartVmWithNoNicCase,确认该用例所编码的隐式契约成立。

P0 — Critical

无。

🟡 Warning

# File:Line Issue Fix Reviewer Conf Route
1 test/.../kvm/vm/HaStartVmWithNoNicCase.groovy(新增, 全文件) 新增集成测试未在本地执行(Jira 自测说明本机缺 mysql 未完整跑),仅靠三节点实机覆盖行为。该用例编码了多个隐式契约:HA 路径确实把 AllocationScene 设为 Auto、对 Running 状态的无网卡 VM 发 HaStartVmInstanceMsg 能触发再分配、detachAllNics 能逐个摘除最后一张网卡、自定义 ZSTAC72149HaStartVmJudger 正确生效。任一不成立则用例失败,而这些目前都未被自动化验证。 合并前跑:mvn test -Dtest=StabilityTestCase -Dcases=org.zstack.test.integration.kvm.vm.HaStartVmWithNoNicCase -Dtimes=2,确认通过 testing 0.80 advisory → 作者

🟢 Suggestion

# File:Line Issue Fix Reviewer Conf Route
1 VmAllocateHostForStoppedVmFlow.java getPreferredClusterUuid / shouldRetryWithoutPreferredCluster 两个私有方法重复推导同一组条件(requiredClusterUuid==null / 无网卡 / vm.clusterUuid)。当前可读、非必须改。 可抽 isNoNicClusterPinned(spec) 复用,减少漂移风险 maintainability 0.55 advisory
2 VmAllocateHostForStoppedVmFlow.java handleReply 失败分支 跨 cluster 重试失败时,返回给调用方/HA 日志的是「去约束重试」后的错误,首次「本 cluster 无可用 host」的错误被覆盖,降低 HA 排障可读性。 重试失败时把首个 reply.getError() 一并记入日志 observability 0.50 advisory

Coverage

  • 正确性已核对getPreferredClusterUuid 与原 cluster 约束逻辑严格等价(原先「两分支都不命中→不设 clusterUuid」≡ 现在 setClusterUuid(null),未设字段默认即 null);重试触发条件是 preferred-cluster 路径的严格子集(vm.clusterUuid != null 保证首次确实带约束),不会出现"约束与上次相同的无意义重试";异步链单次完成——失败且需重试时 return 不调 chain.fail,仅重试回调的 handleReply 终结链,无双触发。
  • 框架兜底已确认(排除一条疑似 Critical)VM_HA_ACROSS_CLUSTERS@GlobalConfigDef(defaultValue = "true", type = Boolean.class)rcf.getResourceConfigValue(..., Boolean.class) 必回退到该默认值、不返回 null,因此 && 链中的自动拆箱不会 NPE
  • 手动启动不受影响VmInstanceSpec.allocationScene 默认 null,gate 用 AllocationScene.Auto.equals(...)(null-safe);非 HA(manual start)路径不进入跨 cluster 重试,符合 Jira「普通手动启动不改变」。
  • 共享存储约束保留:重试复用同一 allocateHostRequiredPrimaryStorageUuids 不变,跨 cluster 目标 host 仍须可达同一主存储,行为正确。
  • Upstream freshness:目标分支 4.8.38diverged_commits_count=0merge_status=can_be_merged,无 rebase/冲突风险。
  • 被 Quality Gate 抑制:自动拆箱 NPE(G3 框架兜底);import/编译类问题(G5,作者已 mvn -pl compute -am -DskipTests package 通过)。

Verdict: APPROVED

修复正确、完整,落点在正确的模块/层(VmAllocateHostForStoppedVmFlow),与 Jira 根因与解决方案逐条一致;三节点实机覆盖了本 cluster 优先 / 跨 cluster fallback / 关闭后不 fallback 三类核心场景。唯一待办:合并前在 CI/稳定性框架跑通新增的 HaStartVmWithNoNicCase(当前仅实机验证、自动化用例未本地执行)。两条 Suggestion 可选采纳。


🤖 Robot Reviewer

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@compute/src/main/java/org/zstack/compute/vm/VmAllocateHostForStoppedVmFlow.java`:
- Around line 39-40: Rename the autowired field rcf to the full, non-abbreviated
name resourceConfigFacade in class VmAllocateHostForStoppedVmFlow (type
ResourceConfigFacade) and update every use site in this class (replace rcf
references, e.g., the usage currently referenced later in the file) to
resourceConfigFacade so the code compiles and follows the
no-unnecessary-abbreviations guideline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d6c90dd6-45c0-415b-aeea-5ac16a1dd8da

📥 Commits

Reviewing files that changed from the base of the PR and between 18b8819 and b63bfa0.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/vm/VmAllocateHostForStoppedVmFlow.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/vm/HaStartVmWithNoNicCase.groovy

Comment on lines +39 to +40
@Autowired
private ResourceConfigFacade rcf;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

字段命名使用了不必要的缩写。

字段 rcfResourceConfigFacade 的缩写。根据编码指南,"不允许使用不必要的缩写",应使用完整单词提升可读性。建议重命名为 resourceConfigFacade

🔧 建议的修复
 `@Autowired`
-private ResourceConfigFacade rcf;
+private ResourceConfigFacade resourceConfigFacade;

同时需要更新行90处的引用:

-        && rcf.getResourceConfigValue(VmGlobalConfig.VM_HA_ACROSS_CLUSTERS,
+        && resourceConfigFacade.getResourceConfigValue(VmGlobalConfig.VM_HA_ACROSS_CLUSTERS,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@compute/src/main/java/org/zstack/compute/vm/VmAllocateHostForStoppedVmFlow.java`
around lines 39 - 40, Rename the autowired field rcf to the full,
non-abbreviated name resourceConfigFacade in class
VmAllocateHostForStoppedVmFlow (type ResourceConfigFacade) and update every use
site in this class (replace rcf references, e.g., the usage currently referenced
later in the file) to resourceConfigFacade so the code compiles and follows the
no-unnecessary-abbreviations guideline.

Source: Coding guidelines

@zstack-robot-2

Copy link
Copy Markdown
Collaborator

Comment from ya.wang:

已核对该 review。

  • 结论:当前是 APPROVED,没有阻塞代码问题。
  • Warning 处理:我本地重新尝试执行新增稳定性用例,原评论里的全仓命令会在非 test 模块被 surefire 拦住;按仓库实际模块方式改为:
mvn -Ppremium -pl test -am test -Dtest=TestCaseStabilityTest -Dcases=org.zstack.test.integration.kvm.vm.HaStartVmWithNoNicCase -Dtimes=2 -Dsurefire.failIfNoSpecifiedTests=false

该命令已编译到 test 模块,但 case 启动前部署测试 DB 失败,原因是本机缺 mysql CLI:

build/deploydb.sh root
sudo: mysql: command not found

因此新增 case 仍无法在当前本机完成自动化执行。三节点真实 ZStack 环境已覆盖核心行为:

场景 结果
NeverStop 无网卡 VM,原 host 阻断 NFS,同 cluster 还有可用 host HA 到同 cluster host,验证本 cluster 优先
同 cluster 无可用 host,跨 cluster host 可用,vm.ha.across.clusters=true HA 到另一个 cluster host
同 cluster 无可用 host,跨 cluster host 可用,vm.ha.across.clusters=false 不跨 cluster,保持失败/Stopped

两条 suggestion 我先不改:当前两个 helper 的条件虽然有少量重复,但分别表达“首次 cluster 约束”和“失败后是否可去约束重试”,可读性更直接;重试失败返回最终分配错误符合当前 flow 行为,首次错误已可从真实 HA/allocator 日志排查,暂不扩大变更范围。

1. Why?
No-nic VMs keep no L3 constraint during HA start. The stopped
VM allocator still pinned them to their previous cluster, so HA could
fail when that cluster had no available host.

2. How?
- Prefer the VM's current cluster for no-nic HA allocation first.
- Retry without the cluster limit only for auto HA starts.
- Gate the retry by the VM HA across clusters resource config.
- Add coverage for same-cluster, fallback, and disabled fallback.

3. Side effects?
Manual starts and explicit host or cluster starts keep their existing
allocation constraints.

# Summary of changes (by module):
- compute: retry no-nic HA allocation without cluster when allowed
- test: cover no-nic HA cluster preference and fallback behavior

Resolves: ZSTAC-72149
Change-Id: I4ede874e00f33b9b1fbb386520b3096f2d5c910f
@MatheMatrix

Copy link
Copy Markdown
Owner

Comment from ya.wang:

两个 suggestion 已处理并推送到当前 MR:

  • 抽出 isNoNicVmWithCurrentCluster(spec),复用首次 cluster 约束和跨 cluster retry gate 的公共条件。
  • 增加 retry 失败日志:跨 cluster 重试也失败时,会同时记录首次 preferred cluster 分配错误和 retry 分配错误;对调用方返回的最终错误保持不变。

验证:

git diff --check
mvn -pl compute -am -DskipTests package

均通过。

@MatheMatrix MatheMatrix force-pushed the sync/zstackio/fix/ZSTAC-72149 branch from b63bfa0 to 06e1535 Compare June 9, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants