fix: optimize InfiniLM paged attention kernels by voltjia · Pull Request #735 · InfiniTensor/InfiniOps

voltjia · 2026-06-22T11:36:51Z

Summary

Port the faster InfiniLM paged-attention decode and prefill kernel paths into src/native/cuda/ops/paged_attention_infinilm/kernel.cuh and src/native/cuda/ops/paged_attention_prefill_infinilm/kernel.cuh.
Use split-KV CTA decode for the common head_dim=128 path, with workspace-backed combine.
Use the pipelined warp-CTA prefill path for head_dim=128 and block_size=256, while keeping existing fallback coverage.
Clean up the migrated internal kernel helper names, template parameters, and parameter lists to follow Google C++ style conventions.

Motivation

The InfiniOps integration path for InfiniLM paged attention was significantly slower than the previous InfiniCore implementation because it did not use the old fast paged-attention kernels for common InfiniLM shapes.

Before optimization, the integrated InfiniOps path measured about 5427.75 ms total time for the benchmark below, versus about 3069.50 ms for the no-InfiniOps baseline on the same model/shape. This PR restores the decode split-KV CTA path and the pipelined prefill path, bringing the optimized InfiniOps path back to about 3096.99 ms.

Benchmark setup:

Hardware: NVIDIA GPU2 on the remote NVIDIA test machine.
Model: /data-aisoft/mechdancer/models/9g_8b_thinking.
Command shape: bs=1, input_len=256, output_len=256, bfloat16, paged attention + graph enabled.
Command: python examples/bench.py --device nvidia --model=/data-aisoft/mechdancer/models/9g_8b_thinking --enable-paged-attn --enable-graph --input-len=256 --output-len=256 --batch-size=1 --warmup.

Closes #

Type of Change

feat — new feature / new operator / new platform
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

Smoke Test Result

Exact InfiniOps pytest -m smoke was not run in this workspace. Validation was done through the InfiniCore/InfiniLM integration path that exercises these InfiniLM paged-attention kernels on NVIDIA SM80.

# Integrated InfiniCore + InfiniOps NVIDIA SM80 build
infinicore import ok <module 'infinicore' from '/workspace/perf/core-ops-sm80/python/infinicore/__init__.py'>
libinfiniops.so installed under /workspace/perf/root-ops-sm80/lib

# InfiniLM rebuild against the integrated environment
build ok
install ok
imports ok <module 'infinicore' ...> <module 'infinilm' ...>

Test Results on Supported Platforms

Platform	Affected	Build / Smoke Result	Full Result / Notes
NVIDIA	Yes	Integrated SM80 build/import passed; InfiniLM build/import passed.	Full InfiniOps pytest not run.
Iluvatar	No	N/A - not affected	N/A - not affected
MetaX	No	N/A - not affected	N/A - not affected
Cambricon	No	N/A - not affected	N/A - not affected
Moore	No	N/A - not affected	N/A - not affected
Ascend	No	N/A - not affected	N/A - not affected

Full `pytest` output (optional)

N/A - full pytest was not run for this NVIDIA-only InfiniLM paged-attention kernel change.

Benchmark / Performance Impact

Focused benchmark on NVIDIA GPU2 with 9g_8b_thinking, bs=1, input_len=256, output_len=256, bfloat16, paged attention + graph enabled:

Path	total_time	Prefill TTFT	Decode throughput
InfiniOps before optimization	5427.75 ms	45.36 ms	47.38 tok/s
no-InfiniOps baseline	3069.50 ms	30.13 ms	83.93 tok/s
This PR after style cleanup	3096.99 ms	31.83 ms	83.21 tok/s

The optimized InfiniOps path is within about 1% of the measured no-InfiniOps baseline for this focused case.

Notes for Reviewers

The moved kernels are now contained directly in the InfiniOps kernel.cuh files for decode and prefill; there are no legacy helper files and no cross-tree includes back into InfiniCore.
Please pay particular attention to the dispatch conditions for the fast paths: decode head_dim=128 split-KV CTA and prefill head_dim=128 && block_size=256 pipelined warp-CTA.
The fallback paths remain in place for unsupported shapes.

voltjia changed the title ~~fix: optimize infinilm paged attention kernels~~ fix: optimize InfiniLM paged attention kernels Jun 22, 2026

fix: optimize infinilm paged attention kernels

c796425

voltjia force-pushed the fix/infinops-paged-attention-performance branch from 9fd6b5f to c796425 Compare June 22, 2026 12:03

voltjia marked this pull request as ready for review June 22, 2026 12:04

voltjia requested a review from a team June 22, 2026 12:04

voltjia merged commit d7a03c7 into master Jun 22, 2026
18 checks passed

voltjia deleted the fix/infinops-paged-attention-performance branch June 22, 2026 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: optimize InfiniLM paged attention kernels#735

fix: optimize InfiniLM paged attention kernels#735
voltjia merged 1 commit into
masterfrom
fix/infinops-paged-attention-performance

voltjia commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

voltjia commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Type of Change

Platforms Affected

Smoke Test Result

Test Results on Supported Platforms

Benchmark / Performance Impact

Notes for Reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

voltjia commented Jun 22, 2026 •

edited

Loading