aarch64: add SVE-accelerated memcpy and memzero#1300
Conversation
Introduce Vector Length Agnostic (VLA) SVE implementations for _mi_memzero and _mi_memcpy on AArch64 systems. Key features: - Runtime SVE feature detection using AT_HWCAP - Compiler capability gating for older GCC/Clang toolchains - Vector-length-aware activation gating - Predicated VLA loops with unrolling for reduced branch overhead - Safe fallback to existing scalar/Neon implementations Performance improvements observed on Neoverse-V1 (Graviton3) for large allocations include: - Reduced retired instructions - Significant branch-miss reduction - Improved memcpy/memzero throughput Signed-off-by: maajidkhann <maajidkhan.n@fujitsu.com>
|
@microsoft-github-policy-service agree company=“Fujitsu Research of India Private Ltd” |
|
Hi @daanx. Can you help review this PR! |
|
Hi @maajidkhann, thank you for the PR. I am quite hesitant though to apply this. I am trying to keep mimalloc simple and I'd like to defer to the provided |
Thanks for the review, Daan. That's a fair concern. My motivation was that mimalloc currently provides architecture-specific fast paths already (e.g. the x64 rep movsb optimization), and I viewed the SVE implementation as an analogous optimization for modern AArch64 systems. That said, I understand the preference to keep mimalloc lightweight and defer to the platform libc implementations where possible. I'll look into whether this work would be more appropriate upstream in glibc or other ARM-focused runtime libraries. Thanks again for taking the time to review it. |
Summary
This PR introduces high-performance Vector Length Agnostic (VLA) primitives for memzero and memcpy on AArch64 using Scalable Vector Extensions (SVE).
Key Technical Features:
*VLA Implementation: Scale across hardware with SVE support (128-bit to 2048-bit vectors).
*Runtime Vector-Length Awareness: The SVE fast path is enabled only for systems with vector lengths >= 256-bit. Platforms with 128-bit SVE implementations (where optimized NEON/ASIMD paths remain competitive) automatically fall back to existing standard primitives to avoid performance regressions from SVE setup overhead.
*Dual Primitive Support: Includes optimized SVE paths for both memory clearing (memzero) and data movement (memcpy).
*Runtime Safety: Automatic fallback to standard primitives on non-SVE hardware via AT_HWCAP feature detection.
*Ecosystem Impact: Directly enhances PyTorch on ARM (which recently adopted mimalloc as default) by accelerating tensor lifecycle management on ARMv9-A hardware.
*Toolchain Compatibility (Compiler Support): Adds compiler-safe SVE integration through compile-time feature detection and function-level target attributes (target("+sve") / target("sve")), allowing mimalloc to preserve baseline AArch64 compatibility while selectively enabling optimized SVE code paths on supported GCC and Clang toolchains. Older or non-SVE-capable compilers automatically fall back to existing implementations.
Key Performance Highlights
Instruction Efficiency
Throughput Improvements
Branch Predictability
Hardware-Aware Runtime Gating
Operational Efficiency
Benchmarking Environment
Full Performance Comparison: AArch64 SVE vs. OSS (Standard)
Environment: AWS Graviton 3 (Neoverse-V1), 32vcpu, SVE256 Hardware support | 100,000 iterations per size
Test script:
fast_bench.cpp
Benchmark Source
Compile:
g++ -O3 -Drestrict=__restrict -I/home/maajid/mimalloc_build_apr28/mimalloc/include fast_bench.cpp -lpthread -o bench_sve
Test
./bench_sve 2048 1 (Where, n = threshold size = 2048, iterations = 1)
Final Benchmarking Script: