aarch64: add SVE-accelerated memcpy and memzero by maajidkhann · Pull Request #1300 · microsoft/mimalloc

maajidkhann · 2026-05-26T08:43:48Z

Summary

This PR introduces high-performance Vector Length Agnostic (VLA) primitives for memzero and memcpy on AArch64 using Scalable Vector Extensions (SVE).

Key Technical Features:
*VLA Implementation: Scale across hardware with SVE support (128-bit to 2048-bit vectors).

*Runtime Vector-Length Awareness: The SVE fast path is enabled only for systems with vector lengths >= 256-bit. Platforms with 128-bit SVE implementations (where optimized NEON/ASIMD paths remain competitive) automatically fall back to existing standard primitives to avoid performance regressions from SVE setup overhead.

*Dual Primitive Support: Includes optimized SVE paths for both memory clearing (memzero) and data movement (memcpy).

*Runtime Safety: Automatic fallback to standard primitives on non-SVE hardware via AT_HWCAP feature detection.

*Ecosystem Impact: Directly enhances PyTorch on ARM (which recently adopted mimalloc as default) by accelerating tensor lifecycle management on ARMv9-A hardware.

*Toolchain Compatibility (Compiler Support): Adds compiler-safe SVE integration through compile-time feature detection and function-level target attributes (target("+sve") / target("sve")), allowing mimalloc to preserve baseline AArch64 compatibility while selectively enabling optimized SVE code paths on supported GCC and Clang toolchains. Older or non-SVE-capable compilers automatically fall back to existing implementations.

Key Performance Highlights

Instruction Efficiency
- Achieves up to 12.3% reduction in retired instructions for large buffers (64KB+).
- Uses a 2048B crossover threshold to avoid regressions on small-object metadata paths.
Throughput Improvements
- Delivers consistent 4%–7% bandwidth gains on AWS Graviton 3 (Neoverse-V1).
- Achieved through 4× unrolled SVE vector loops that better saturate the execution pipeline.
Branch Predictability
- Eliminates the scalar-loop “branch cliff” behavior on large copies.
- Reduces branch mispredictions by 80%+ for buffers ≥ 8KB.
Hardware-Aware Runtime Gating
- Dynamically detects available SVE Vector Length (VL) at runtime.
- SVE fast path activates only on systems with VL ≥ 256-bit.
- Safely falls back to optimized scalar implementations on 128-bit systems such as AWS Graviton 4.
Operational Efficiency
- Lower instruction counts and reduced pipeline stalls improve overall compute efficiency.
- Results in lower power-per-GB transferred for high-density cloud and memory-intensive workloads.

Benchmarking Environment

Full Performance Comparison: AArch64 SVE vs. OSS (Standard)

Environment: AWS Graviton 3 (Neoverse-V1), 32vcpu, SVE256 Hardware support | 100,000 iterations per size

Size (B)	Logic	Metric	OSS (Std)	SVE (Opt)	Impact
128	Scalar	Throughput	47.12 GB/s	47.51 GB/s	+0.8%
		Instructions	13.72M	15.33M	+11.7% (Loss)*
		Br-Misses	31.2k	30.8k	-1.2%
256	Scalar	Throughput	58.49 GB/s	61.23 GB/s	+4.6%
		Instructions	16.92M	18.51M	+9.4% (Loss)*
		Br-Misses	32.5k	32.1k	-1.2%
512	Scalar	Throughput	77.05 GB/s	78.07 GB/s	+1.3%
		Instructions	21.31M	22.89M	+7.4% (Loss)*
		Br-Misses	33.1k	32.8k	-0.9%
1024	Scalar	Throughput	76.75 GB/s	77.68 GB/s	+1.2%
		Instructions	30.95M	32.54M	+5.1% (Loss)*
		Br-Misses	32.1k	31.9k	-0.6%
2048	SVE	Throughput	79.27 GB/s	82.61 GB/s	+4.2% 🚀
		Instructions	50.15M	46.14M	-8.0% 📉
		Br-Misses	34.8k	33.2k	-4.6%
4096	SVE	Throughput	77.50 GB/s	82.52 GB/s	+6.4%
		Instructions	88.54M	79.78M	-9.9%
		Br-Misses	35.2k	34.1k	-3.1%
8192	SVE	Throughput	79.67 GB/s	82.70 GB/s	+3.8%
		Instructions	165.39M	147.11M	-11.0%
		Br-Misses	188.4k	35.1k	-81.3% ⚡
16384	SVE	Throughput	76.21 GB/s	81.15 GB/s	+6.4%
		Instructions	324.12M	285.55M	-11.9%
		Br-Misses	210.5k	38.2k	-81.8% ⚡
65536	SVE	Throughput	74.82 GB/s	79.91 GB/s	+6.8% 🚀
		Instructions	1.24B	1.08B	-12.3% 📉
		Br-Misses	232.1k	44.8k	-80.7% ⚡

Test script:
fast_bench.cpp

Benchmark Source

#include <iostream>
#include <vector>
#include <chrono>
#include <cstring>

// This includes the entire mimalloc source into our benchmark
// Adjust this path to point to your mimalloc/src/static.c
#include "/home/maajid/mimalloc_build_apr28/mimalloc/src/static.c"

int main(int argc, char** argv) {
    size_t size = (argc > 1) ? std::stoull(argv[1]) : 512 * 1024 * 1024;
    int iterations = (argc > 2) ? std::stoi(argv[2]) : 50;
    
    void* buffer_a = mi_malloc(size);
    void* buffer_b = mi_malloc(size);

    if (!buffer_a || !buffer_b) return 1;

    // --- Benchmark Memzero ---
    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < iterations; ++i) {
        _mi_memzero(buffer_a, size);
    }

    auto end = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> diff = end - start;

    std::cout << "Memzero Throughput: "
              << (double(size) * iterations) / diff.count() / 1e9
              << " GB/s" << std::endl;

    // --- Benchmark Memcpy ---
    start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < iterations; ++i) {
        _mi_memcpy(buffer_b, buffer_a, size);
    }

    end = std::chrono::high_resolution_clock::now();

    diff = end - start;

    std::cout << "Memcpy Throughput: "
              << (double(size) * iterations) / diff.count() / 1e9
              << " GB/s" << std::endl;

    mi_free(buffer_a);
    mi_free(buffer_b);

    return 0;
}

Compile:
g++ -O3 -Drestrict=__restrict -I/home/maajid/mimalloc_build_apr28/mimalloc/include fast_bench.cpp -lpthread -o bench_sve

Test
./bench_sve 2048 1 (Where, n = threshold size = 2048, iterations = 1)

Final Benchmarking Script:

#!/bin/bash

# --- Paths to your libraries ---
OSS_LIB="/home/maajid/mimalloc_build_apr28/oss_build/mimalloc/out/release/libmimalloc.so"
SVE_LIB="/home/maajid/mimalloc_build_apr28/mimalloc/out/release/libmimalloc.so"

# Verify libraries exist
if [[ ! -f "$OSS_LIB" || ! -f "$SVE_LIB" ]]; then
    echo "Error: One or both libraries not found!"
    echo "OSS: $OSS_LIB"
    echo "SVE: $SVE_LIB"
    exit 1
fi

SIZES=(128 256 512 1024 2048 4096 8192 16384 32768 65536)
ITERS=100000

echo "Size(B)  | Build | GB/s    | Instructions | IPC  | Br-Miss | Insn Delta"
echo "----------------------------------------------------------------------------"

for SZ in "${SIZES[@]}"; do
  INS_OSS=0
  INS_SVE=0

  for BUILD in "oss" "sve"; do
    if [ "$BUILD" == "oss" ]; then
      MIMALLOC_PATH=$OSS_LIB
    else
      MIMALLOC_PATH=$SVE_LIB
    fi

    # Execute with LD_PRELOAD
    OUT=$(LD_PRELOAD=$MIMALLOC_PATH perf stat -x, -e instructions,cycles,branch-misses \
          taskset -c 1 ./bench_$BUILD $SZ $ITERS 2>&1)

    # Parse metrics, filtering out debug messages
    INS=$(echo "$OUT" | grep "instructions" | cut -d, -f1)
    CYC=$(echo "$OUT" | grep "cycles" | cut -d, -f1)
    BRM=$(echo "$OUT" | grep "branch-misses" | cut -d, -f1)
    IPC=$(echo "scale=2; $INS / $CYC" | bc)

    # Extract throughput from binary output (ignore debug lines)
    GBS=$(echo "$OUT" | grep "Throughput" | grep -v "DEBUG" | tail -n 1 | awk '{print $3}')

    if [ "$BUILD" == "oss" ]; then
      INS_OSS=$INS
    else
      INS_SVE=$INS
    fi

    printf "%-8s | %-5s | %-7s | %-12s | %-4s | %-8s | " \
            "$SZ" "$BUILD" "$GBS" "$INS" "$IPC" "$BRM"

    if [ "$BUILD" == "sve" ]; then
      DIFF=$(echo "$INS_SVE - $INS_OSS" | bc)
      if [ $DIFF -lt 0 ]; then
        printf "\e[32m%10d (Win)\e[0m\n" "$DIFF"
      else
        printf "\e[31m+%10d (Loss)\e[0m\n" "$DIFF"
      fi
    else
      echo ""
    fi
  done
  echo "----------------------------------------------------------------------------"
done

Introduce Vector Length Agnostic (VLA) SVE implementations for _mi_memzero and _mi_memcpy on AArch64 systems. Key features: - Runtime SVE feature detection using AT_HWCAP - Compiler capability gating for older GCC/Clang toolchains - Vector-length-aware activation gating - Predicated VLA loops with unrolling for reduced branch overhead - Safe fallback to existing scalar/Neon implementations Performance improvements observed on Neoverse-V1 (Graviton3) for large allocations include: - Reduced retired instructions - Significant branch-miss reduction - Improved memcpy/memzero throughput Signed-off-by: maajidkhann <maajidkhan.n@fujitsu.com>

maajidkhann · 2026-05-26T08:44:32Z

@microsoft-github-policy-service agree company=“Fujitsu Research of India Private Ltd”

maajidkhann · 2026-05-26T08:51:16Z

Hi @daanx. Can you help review this PR!

daanx · 2026-06-22T17:45:11Z

Hi @maajidkhann, thank you for the PR. I am quite hesitant though to apply this. I am trying to keep mimalloc simple and I'd like to defer to the provided memset/memcpy when possible. The current x64 rep movsb exception is already a bit controversial as it doesn't always beat memcpy (but at least it is a very small code change). As it stands, I feel that this PR is perhaps better suited to submit to glibc ?

maajidkhann · 2026-06-23T05:56:11Z

Hi @maajidkhann, thank you for the PR. I am quite hesitant though to apply this. I am trying to keep mimalloc simple and I'd like to defer to the provided memset/memcpy when possible. The current x64 rep movsb exception is already a bit controversial as it doesn't always beat memcpy (but at least it is a very small code change). As it stands, I feel that this PR is perhaps better suited to submit to glibc ?

Thanks for the review, Daan.

That's a fair concern. My motivation was that mimalloc currently provides architecture-specific fast paths already (e.g. the x64 rep movsb optimization), and I viewed the SVE implementation as an analogous optimization for modern AArch64 systems.

That said, I understand the preference to keep mimalloc lightweight and defer to the platform libc implementations where possible. I'll look into whether this work would be more appropriate upstream in glibc or other ARM-focused runtime libraries.

Thanks again for taking the time to review it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64: add SVE-accelerated memcpy and memzero#1300

aarch64: add SVE-accelerated memcpy and memzero#1300
maajidkhann wants to merge 1 commit into
microsoft:mainfrom
MonakaResearch:maajid_add_sve

maajidkhann commented May 26, 2026 •

edited

Loading

Uh oh!

maajidkhann commented May 26, 2026

Uh oh!

maajidkhann commented May 26, 2026

Uh oh!

daanx commented Jun 22, 2026

Uh oh!

maajidkhann commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maajidkhann commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Performance Highlights

Benchmarking Environment

Full Performance Comparison: AArch64 SVE vs. OSS (Standard)

Benchmark Source

Uh oh!

maajidkhann commented May 26, 2026

Uh oh!

maajidkhann commented May 26, 2026

Uh oh!

daanx commented Jun 22, 2026

Uh oh!

maajidkhann commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maajidkhann commented May 26, 2026 •

edited

Loading