Skip to content

feat(moe): add MoE inference and expert parallel support#444

Open
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe
Open

feat(moe): add MoE inference and expert parallel support#444
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe

Conversation

@qinyiqun

Copy link
Copy Markdown
Contributor

Summary

  • Add a generic MoE layer stack under csrc/layers/moe.
  • Route Qwen3-MoE through the generic SparseMoeBlock, TopKRouter, FusedMoeExperts, and FusedMoE runner.
  • Add MoE EP dispatchers for local_allreduce and allgather_reducescatter.
  • Add a reserved deepep backend interface for future integration.
  • Move the old per-expert MoeMLP into csrc/layers/moe/legacy and keep DeepSeek-V2 on the legacy path.
  • Pass MoE EP config through Python args and model config instead of bench-owned environment variables.
  • Optimize rank-local safetensors loading for EP expert weights.
  • Support Qwen3/Qwen3Next GQA cases where num_key_value_heads < tp_size.

Motivation

Closes #

InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.

The current implementation focuses on correctness and data-flow alignment first:

  • TP-only MoE works through the standard dispatcher.
  • DP=1 EP uses local_allreduce as the preferred current path.
  • allgather_reducescatter is available as a correctness-oriented backend.
  • DeepEP is explicitly reserved but not implemented in this PR.

Type of Change

  • feat — new feature / new model
  • refactor — code restructuring without behavior change
  • perf — performance improvement (no behavioral change)
  • fix — bug fix
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Please attach screenshots for the final tested commands.

Suggested coverage:

  • Qwen3-30B-A3B, TP=1, EP disabled
  • Qwen3-30B-A3B, TP=2, EP=2, local_allreduce
  • Qwen3-30B-A3B, TP=2, EP=2, allgather_reducescatter
  • Qwen3-235B-A22B, TP=8, EP=8, local_allreduce
  • Qwen3-8B-base non-MoE regression, TP=2, graph enabled
  • DeepSeek-V2-Lite loading/regression for legacy MoE path if applicable

Benchmark / Performance Impact

Initial measured examples on A100:

  • Qwen3-30B-A3B, TP=2/EP=2, local_allreduce, graph enabled:
    • Prefill and decode are functional.
    • Decode performance is currently limited by MoE communication and temporary fused MoE kernel quality.
  • Qwen3-235B-A22B, TP=8/EP=8, local_allreduce, graph enabled:
    • Model loading and decode are functional.
    • Nsys shows decode is dominated by communication, especially allreduce-heavy paths.

This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.

Notes for Reviewers

  • local_allreduce is the recommended current EP backend for DP=1.
  • allgather_reducescatter is correctness-oriented and expected to be slower.
  • deepep is intentionally a placeholder interface.
  • prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.
  • DeepSeek-V2 remains on layers/moe/legacy and is not migrated to the new fused Qwen3-MoE path.
  • Non-MoE models should show MoE EP backend: disabled.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits.
  • Branch name follows <type>/xxx-yyyy-zzzz.
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable.
  • No stray merge commits from main.
  • No fixup! / squash! / wip commits remain.
  • Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

  • Changes are scoped to MoE inference, EP config/loading, and required model compatibility.
  • No debug prints or temporary MoE logs are left behind.
  • Public API changes are intentional and reflected in Python/C++ callers.

C++ Specific

  • Changed files are formatted by scripts/format.py.
  • Project builds cleanly on NVIDIA.

Python Specific

  • Changed files are formatted by scripts/format.py.

Testing

  • Passed single request test, or reason for skipping is documented.
  • Passed offline performance test, or reason for skipping is documented.
  • Passed sanity test, or reason for skipping is documented.
  • Passed service test, or reason for skipping is documented.

@qinyiqun qinyiqun requested a review from a team June 18, 2026 02:17
- add reusable MoE router, dispatcher, runner, and expert abstractions
- enable Qwen3 MoE fused inference with TP-local expert parallel routing
- add graph-safe MoE workspace handling and EP backend selection through engine config
- preserve legacy MoE path for existing DeepSeek V2 code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant