[https://nvbugs/6322073][fix] Add `_needs_x86_nccl_pp_workaround()` and `init_nccl_pp_workaround()` helpers… by tensorrt-cicd · Pull Request #15480 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-18T17:23:24Z

Summary

Root cause: NCCL <2.30.4 NVLS/MNNVL static-connection setup deadlocks during ncclCommInitRank on x86_64 (B300 SM103) hosts when TRT-LLM forces NCCL_RUNTIME_CONNECT=0; PP-2 disagg workers hang before reaching readiness.
Fix: Add _needs_x86_nccl_pp_workaround() and init_nccl_pp_workaround() helpers in torch_custom_ops.py; call the latter from init_pp_comm before constructing PPCommNCCL. Honors user-set env via os.environ.setdefault. Diff limited to exactly two .py files (avoiding the prior rejection's binary scope creep).
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6322073

Summary by CodeRabbit

Bug Fixes
- Improved stability of distributed pipeline initialization by applying targeted configuration adjustments for specific hardware and NCCL runtime combinations, preventing potential deadlock scenarios in certain deployment configurations.

…ank on x86_64+NCCL<2.30.4 NCCL <2.30.4 NVLS/MNNVL static-connection setup deadlocks during ncclCommInitRank on x86_64 hosts (B300 SM103 here) when TRT-LLM forces NCCL_RUNTIME_CONNECT=0 in cpp/tensorrt_llm/runtime/ncclCommunicator.cpp. Disagg test_disaggregated_ctxpp2_genpp2 hangs because ctx and gen workers each try to create a PP-2 NCCL communicator and stall before "Init COMPLETE", leaving the disagg server unable to reach readiness. Generalize the existing GB10 (DGX Spark) workaround to also fire on x86_64 hosts running NCCL <2.30.4, and disable BOTH NCCL_MNNVL_ENABLE and NCCL_NVLS_ENABLE via os.environ.setdefault from init_pp_comm before NcclCommunicatorOp triggers ncclCommInitRank. Mirrors the platform-aware pattern landed by 448e5e2577 for the AllReduce path (_needs_nccl_symmetric_workaround) and the NCCL fix shipped in 2.30.4. On GB200/aarch64 with NCCL >=2.30.4, behavior is unchanged (env vars are not touched). Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-06-18T17:28:06Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69efcef7-4318-40d2-bd98-e0b7347f8f3b

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8b7af and 5b607dc.

📒 Files selected for processing (2)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/distributed/communicator.py

📝 Walkthrough

Walkthrough

Adds _NCCL_NVLS_ENABLE constant and init_nccl_pp_workaround() to torch_custom_ops.py. The helper checks NCCL runtime version and host architecture (x86_64 or GB10) to conditionally set NCCL_NVLS_ENABLE and NCCL_MNNVL_ENABLE to "0" via os.environ.setdefault. This function is then called in the non-MPI branch of init_pp_comm before PPCommNCCL construction.

Changes

NCCL PP NVLS/MNNVL deadlock workaround

Layer / File(s)	Summary
`init_nccl_pp_workaround` helper and constants `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Adds `platform` import, `_NCCL_NVLS_ENABLE` constant, and `init_nccl_pp_workaround()` which conditionally sets `NCCL_NVLS_ENABLE` and `NCCL_MNNVL_ENABLE` to `"0"` based on NCCL runtime version and whether the host is x86_64 or GB10. Returns a bool indicating whether the workaround was applied.
Workaround invocation in `init_pp_comm` `tensorrt_llm/_torch/distributed/communicator.py`	Imports and calls `init_nccl_pp_workaround` in the non-MPI PP communicator initialization path before `PPCommNCCL` construction, preventing NCCL/NVLS/MNNVL hangs in disaggregated PP-2 workers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#12902: Introduced _is_gb10() detection in the same torch_custom_ops.py module for GB10-specific NCCL workaround logic that init_nccl_pp_workaround now also relies on.

Suggested reviewers

HuiGao-NV
Tabrizian
hyukn

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main change: adding two helper functions to fix an NCCL/NVLS deadlock issue on x86_64 hosts, with proper issue reference and fix type designation.
Description check	✅ Passed	The description covers the root cause, fix details, test verification, and links to the bug report. However, it lacks explicit Test Coverage and PR Checklist sections required by the template.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd requested review from a team as code owners June 18, 2026 17:23

tensorrt-cicd requested a review from hyukn June 18, 2026 17:23

tensorrt-cicd assigned chuangz0 Jun 18, 2026

tensorrt-cicd requested a review from symphonylyh June 18, 2026 17:23

github-actions Bot assigned tensorrt-cicd Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6322073][fix] Add `_needs_x86_nccl_pp_workaround()` and `init_nccl_pp_workaround()` helpers…#15480

[https://nvbugs/6322073][fix] Add `_needs_x86_nccl_pp_workaround()` and `init_nccl_pp_workaround()` helpers…#15480
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6322073

tensorrt-cicd commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading