Skip to content

[https://nvbugs/6322073][fix] Add _needs_x86_nccl_pp_workaround() and init_nccl_pp_workaround() helpers…#15480

Open
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6322073
Open

[https://nvbugs/6322073][fix] Add _needs_x86_nccl_pp_workaround() and init_nccl_pp_workaround() helpers…#15480
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6322073

Conversation

@tensorrt-cicd

@tensorrt-cicd tensorrt-cicd commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: NCCL <2.30.4 NVLS/MNNVL static-connection setup deadlocks during ncclCommInitRank on x86_64 (B300 SM103) hosts when TRT-LLM forces NCCL_RUNTIME_CONNECT=0; PP-2 disagg workers hang before reaching readiness.
  • Fix: Add _needs_x86_nccl_pp_workaround() and init_nccl_pp_workaround() helpers in torch_custom_ops.py; call the latter from init_pp_comm before constructing PPCommNCCL. Honors user-set env via os.environ.setdefault. Diff limited to exactly two .py files (avoiding the prior rejection's binary scope creep).
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes
    • Improved stability of distributed pipeline initialization by applying targeted configuration adjustments for specific hardware and NCCL runtime combinations, preventing potential deadlock scenarios in certain deployment configurations.

…ank on x86_64+NCCL<2.30.4

NCCL <2.30.4 NVLS/MNNVL static-connection setup deadlocks during
ncclCommInitRank on x86_64 hosts (B300 SM103 here) when TRT-LLM forces
NCCL_RUNTIME_CONNECT=0 in cpp/tensorrt_llm/runtime/ncclCommunicator.cpp.
Disagg test_disaggregated_ctxpp2_genpp2 hangs because ctx and gen workers
each try to create a PP-2 NCCL communicator and stall before "Init
COMPLETE", leaving the disagg server unable to reach readiness.

Generalize the existing GB10 (DGX Spark) workaround to also fire on
x86_64 hosts running NCCL <2.30.4, and disable BOTH NCCL_MNNVL_ENABLE and
NCCL_NVLS_ENABLE via os.environ.setdefault from init_pp_comm before
NcclCommunicatorOp triggers ncclCommInitRank.  Mirrors the platform-aware
pattern landed by 448e5e2577 for the AllReduce path
(_needs_nccl_symmetric_workaround) and the NCCL fix shipped in 2.30.4.
On GB200/aarch64 with NCCL >=2.30.4, behavior is unchanged (env vars are
not touched).

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69efcef7-4318-40d2-bd98-e0b7347f8f3b

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8b7af and 5b607dc.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/distributed/communicator.py

📝 Walkthrough

Walkthrough

Adds _NCCL_NVLS_ENABLE constant and init_nccl_pp_workaround() to torch_custom_ops.py. The helper checks NCCL runtime version and host architecture (x86_64 or GB10) to conditionally set NCCL_NVLS_ENABLE and NCCL_MNNVL_ENABLE to "0" via os.environ.setdefault. This function is then called in the non-MPI branch of init_pp_comm before PPCommNCCL construction.

Changes

NCCL PP NVLS/MNNVL deadlock workaround

Layer / File(s) Summary
init_nccl_pp_workaround helper and constants
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Adds platform import, _NCCL_NVLS_ENABLE constant, and init_nccl_pp_workaround() which conditionally sets NCCL_NVLS_ENABLE and NCCL_MNNVL_ENABLE to "0" based on NCCL runtime version and whether the host is x86_64 or GB10. Returns a bool indicating whether the workaround was applied.
Workaround invocation in init_pp_comm
tensorrt_llm/_torch/distributed/communicator.py
Imports and calls init_nccl_pp_workaround in the non-MPI PP communicator initialization path before PPCommNCCL construction, preventing NCCL/NVLS/MNNVL hangs in disaggregated PP-2 workers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#12902: Introduced _is_gb10() detection in the same torch_custom_ops.py module for GB10-specific NCCL workaround logic that init_nccl_pp_workaround now also relies on.

Suggested reviewers

  • HuiGao-NV
  • Tabrizian
  • hyukn
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: adding two helper functions to fix an NCCL/NVLS deadlock issue on x86_64 hosts, with proper issue reference and fix type designation.
Description check ✅ Passed The description covers the root cause, fix details, test verification, and links to the bug report. However, it lacks explicit Test Coverage and PR Checklist sections required by the template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants