feat(moe): add MoE inference and expert parallel support by qinyiqun · Pull Request #444 · InfiniTensor/InfiniLM

qinyiqun · 2026-06-18T02:17:45Z

Summary

Add a generic MoE layer stack under csrc/layers/moe.
Route Qwen3-MoE through the generic SparseMoeBlock, TopKRouter, FusedMoeExperts, and FusedMoE runner.
Add MoE EP dispatchers for local_allreduce and allgather_reducescatter.
Add a reserved deepep backend interface for future integration.
Move the old per-expert MoeMLP into csrc/layers/moe/legacy and keep DeepSeek-V2 on the legacy path.
Pass MoE EP config through Python args and model config instead of bench-owned environment variables.
Optimize rank-local safetensors loading for EP expert weights.
Support Qwen3/Qwen3Next GQA cases where num_key_value_heads < tp_size.

Motivation

Closes #

InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.

The current implementation focuses on correctness and data-flow alignment first:

TP-only MoE works through the standard dispatcher.
DP=1 EP uses local_allreduce as the preferred current path.
allgather_reducescatter is available as a correctness-oriented backend.
DeepEP is explicitly reserved but not implemented in this PR.

Type of Change

feat — new feature / new model
refactor — code restructuring without behavior change
perf — performance improvement (no behavioral change)
fix — bug fix
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Please attach screenshots for the final tested commands.

Suggested coverage:

Qwen3-30B-A3B, TP=1, EP disabled
Qwen3-30B-A3B, TP=2, EP=2, local_allreduce
Qwen3-30B-A3B, TP=2, EP=2, allgather_reducescatter
Qwen3-235B-A22B, TP=8, EP=8, local_allreduce
Qwen3-8B-base non-MoE regression, TP=2, graph enabled
DeepSeek-V2-Lite loading/regression for legacy MoE path if applicable

Benchmark / Performance Impact

Initial measured examples on A100:

Qwen3-30B-A3B, TP=2/EP=2, local_allreduce, graph enabled:
- Prefill and decode are functional.
- Decode performance is currently limited by MoE communication and temporary fused MoE kernel quality.
Qwen3-235B-A22B, TP=8/EP=8, local_allreduce, graph enabled:
- Model loading and decode are functional.
- Nsys shows decode is dominated by communication, especially allreduce-heavy paths.

This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.

Notes for Reviewers

local_allreduce is the recommended current EP backend for DP=1.
allgather_reducescatter is correctness-oriented and expected to be slower.
deepep is intentionally a placeholder interface.
prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.
DeepSeek-V2 remains on layers/moe/legacy and is not migrated to the new fused Qwen3-MoE path.
Non-MoE models should show MoE EP backend: disabled.

Checklist

Title, Branch, and Commits

PR title follows Conventional Commits.
Branch name follows <type>/xxx-yyyy-zzzz.
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable.
No stray merge commits from main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are scoped to MoE inference, EP config/loading, and required model compatibility.
No debug prints or temporary MoE logs are left behind.
Public API changes are intentional and reflected in Python/C++ callers.

C++ Specific

Changed files are formatted by scripts/format.py.
Project builds cleanly on NVIDIA.

Python Specific

Changed files are formatted by scripts/format.py.

Testing

Passed single request test, or reason for skipping is documented.
Passed offline performance test, or reason for skipping is documented.
Passed sanity test, or reason for skipping is documented.
Passed service test, or reason for skipping is documented.

- add reusable MoE router, dispatcher, runner, and expert abstractions - enable Qwen3 MoE fused inference with TP-local expert parallel routing - add graph-safe MoE workspace handling and EP backend selection through engine config - preserve legacy MoE path for existing DeepSeek V2 code

qinyiqun requested a review from a team June 18, 2026 02:17

qinyiqun force-pushed the moe branch from adb5ae9 to 4b3058a Compare June 18, 2026 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(moe): add MoE inference and expert parallel support#444

feat(moe): add MoE inference and expert parallel support#444
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe

qinyiqun commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qinyiqun commented Jun 18, 2026

Summary

Motivation

Type of Change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

C++ Specific

Python Specific

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant