[https://nvbugs/6322073][fix] Add _needs_x86_nccl_pp_workaround() and init_nccl_pp_workaround() helpers…#15480
Conversation
…ank on x86_64+NCCL<2.30.4 NCCL <2.30.4 NVLS/MNNVL static-connection setup deadlocks during ncclCommInitRank on x86_64 hosts (B300 SM103 here) when TRT-LLM forces NCCL_RUNTIME_CONNECT=0 in cpp/tensorrt_llm/runtime/ncclCommunicator.cpp. Disagg test_disaggregated_ctxpp2_genpp2 hangs because ctx and gen workers each try to create a PP-2 NCCL communicator and stall before "Init COMPLETE", leaving the disagg server unable to reach readiness. Generalize the existing GB10 (DGX Spark) workaround to also fire on x86_64 hosts running NCCL <2.30.4, and disable BOTH NCCL_MNNVL_ENABLE and NCCL_NVLS_ENABLE via os.environ.setdefault from init_pp_comm before NcclCommunicatorOp triggers ncclCommInitRank. Mirrors the platform-aware pattern landed by 448e5e2577 for the AllReduce path (_needs_nccl_symmetric_workaround) and the NCCL fix shipped in 2.30.4. On GB200/aarch64 with NCCL >=2.30.4, behavior is unchanged (env vars are not touched). Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds ChangesNCCL PP NVLS/MNNVL deadlock workaround
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
_needs_x86_nccl_pp_workaround()andinit_nccl_pp_workaround()helpers in torch_custom_ops.py; call the latter frominit_pp_commbefore constructing PPCommNCCL. Honors user-set env via os.environ.setdefault. Diff limited to exactly two .py files (avoiding the prior rejection's binary scope creep).Test plan
Links
Summary by CodeRabbit