Skip to content

fix: optimize InfiniLM paged attention kernels#735

Merged
voltjia merged 1 commit into
masterfrom
fix/infinops-paged-attention-performance
Jun 22, 2026
Merged

fix: optimize InfiniLM paged attention kernels#735
voltjia merged 1 commit into
masterfrom
fix/infinops-paged-attention-performance

Conversation

@voltjia

@voltjia voltjia commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Port the faster InfiniLM paged-attention decode and prefill kernel paths into src/native/cuda/ops/paged_attention_infinilm/kernel.cuh and src/native/cuda/ops/paged_attention_prefill_infinilm/kernel.cuh.
  • Use split-KV CTA decode for the common head_dim=128 path, with workspace-backed combine.
  • Use the pipelined warp-CTA prefill path for head_dim=128 and block_size=256, while keeping existing fallback coverage.
  • Clean up the migrated internal kernel helper names, template parameters, and parameter lists to follow Google C++ style conventions.

Motivation

The InfiniOps integration path for InfiniLM paged attention was significantly slower than the previous InfiniCore implementation because it did not use the old fast paged-attention kernels for common InfiniLM shapes.

Before optimization, the integrated InfiniOps path measured about 5427.75 ms total time for the benchmark below, versus about 3069.50 ms for the no-InfiniOps baseline on the same model/shape. This PR restores the decode split-KV CTA path and the pipelined prefill path, bringing the optimized InfiniOps path back to about 3096.99 ms.

Benchmark setup:

  • Hardware: NVIDIA GPU2 on the remote NVIDIA test machine.
  • Model: /data-aisoft/mechdancer/models/9g_8b_thinking.
  • Command shape: bs=1, input_len=256, output_len=256, bfloat16, paged attention + graph enabled.
  • Command: python examples/bench.py --device nvidia --model=/data-aisoft/mechdancer/models/9g_8b_thinking --enable-paged-attn --enable-graph --input-len=256 --output-len=256 --batch-size=1 --warmup.

Closes #

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Smoke Test Result

Exact InfiniOps pytest -m smoke was not run in this workspace. Validation was done through the InfiniCore/InfiniLM integration path that exercises these InfiniLM paged-attention kernels on NVIDIA SM80.

# Integrated InfiniCore + InfiniOps NVIDIA SM80 build
infinicore import ok <module 'infinicore' from '/workspace/perf/core-ops-sm80/python/infinicore/__init__.py'>
libinfiniops.so installed under /workspace/perf/root-ops-sm80/lib

# InfiniLM rebuild against the integrated environment
build ok
install ok
imports ok <module 'infinicore' ...> <module 'infinilm' ...>

Test Results on Supported Platforms

Platform Affected Build / Smoke Result Full Result / Notes
NVIDIA Yes Integrated SM80 build/import passed; InfiniLM build/import passed. Full InfiniOps pytest not run.
Iluvatar No N/A - not affected N/A - not affected
MetaX No N/A - not affected N/A - not affected
Cambricon No N/A - not affected N/A - not affected
Moore No N/A - not affected N/A - not affected
Ascend No N/A - not affected N/A - not affected
Full `pytest` output (optional)
N/A - full pytest was not run for this NVIDIA-only InfiniLM paged-attention kernel change.

Benchmark / Performance Impact

Focused benchmark on NVIDIA GPU2 with 9g_8b_thinking, bs=1, input_len=256, output_len=256, bfloat16, paged attention + graph enabled:

Path total_time Prefill TTFT Decode throughput
InfiniOps before optimization 5427.75 ms 45.36 ms 47.38 tok/s
no-InfiniOps baseline 3069.50 ms 30.13 ms 83.93 tok/s
This PR after style cleanup 3096.99 ms 31.83 ms 83.21 tok/s

The optimized InfiniOps path is within about 1% of the measured no-InfiniOps baseline for this focused case.

Notes for Reviewers

  • The moved kernels are now contained directly in the InfiniOps kernel.cuh files for decode and prefill; there are no legacy helper files and no cross-tree includes back into InfiniCore.
  • Please pay particular attention to the dispatch conditions for the fast paths: decode head_dim=128 split-KV CTA and prefill head_dim=128 && block_size=256 pipelined warp-CTA.
  • The fallback paths remain in place for unsupported shapes.

@voltjia voltjia changed the title fix: optimize infinilm paged attention kernels fix: optimize InfiniLM paged attention kernels Jun 22, 2026
@voltjia voltjia force-pushed the fix/infinops-paged-attention-performance branch from 9fd6b5f to c796425 Compare June 22, 2026 12:03
@voltjia voltjia marked this pull request as ready for review June 22, 2026 12:04
@voltjia voltjia requested a review from a team June 22, 2026 12:04
@voltjia voltjia merged commit d7a03c7 into master Jun 22, 2026
18 checks passed
@voltjia voltjia deleted the fix/infinops-paged-attention-performance branch June 22, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant