Optimize GPU radix sort: ballot kernels, fused range, skip-pass, tuning by shreyas-omkar · Pull Request #93 · JuliaGPU/AcceleratedKernels.jl

shreyas-omkar · 2026-07-01T07:44:23Z

Follow-up to #90 (which added the opt-in RadixSort()). This makes the GPU
radix sort substantially faster while keeping it correct on every backend.

On an RTX 5080, 4M elements, default settings, RadixSort() now beats the
existing GPU MergeSort() for every supported type and is ~2.3–7.7× faster
than the radix sort merged in #90.

Benchmarks (RTX 5080, 4M elements)

Type	RadixSort (this PR)	MergeSort	RadixSort vs Merge
Int32	1.44 ms	1.54 ms	1.05× faster
Float32	1.28 ms	1.75 ms	1.37× faster
Float64	2.37 ms	4.81 ms	1.94× faster

Structured / small-range data is much faster via pass skipping (e.g. UInt32
values in [0, 255] sort in a single pass: ~0.53 ms, 7.9 GE/s).

What changed

Fused min/max range. The pass-skip range is now computed with a single
mapreduce over the sort keys instead of separate minimum + maximum
passes. This alone was the largest win (up to ~5× on the range step for
64-bit types, which dominated the previous timing).
Skip-pass. A byte's histogram+scan+scatter is skipped when the min and
max keys share the whole suffix from that byte up — free for structured or
small-range data.
Ballot-aggregated histogram (CUDA). Per-warp leaders add same-digit
popcounts to per-warp sub-histograms via sub_group_ballot, i.e. O(warps)
per block instead of the O(N²) broadcast scan. No shared-memory atomics
(Atomix has no @localmem atomic path on CUDA/POCL).
Ballot intra-block scatter rank (CUDA).
Faster generic histogram. The scan histogram now precomputes each
element's digit once instead of recomputing it inside the per-bucket scan
(~2.6× on that phase); used on all non-CUDA backends.
block_size tuning. GPU sort block_size now defaults to nothing, and
each algorithm resolves its own tuned default (merge 256, radix 512); an
explicit block_size is still honored.
Reuse accumulate! scratch across passes to avoid per-pass allocation.

Non-CUDA backends use the scan-based histogram and scatter throughout (the
ballot kernels require a real 32-lane sub-group).

Correctness

CPU (POCL) suite: 34,072 / 34,072 pass.
GPU (CUDA, RTX 5080): 205 / 205 across all six element types, block sizes
64–512, ascending and descending, plus edge cases (all-equal, narrow ranges,
tiny sizes)

…ing) Rework the opt-in GPU LSD radix sort for correctness and speed. On an RTX 5080 at 4M elements this is ~2.3x faster than the previous radix implementation for UInt32 and now beats merge sort for every supported type (UInt32 1.05x, Float32 1.37x, Float64 1.94x); small-range/structured data is much faster still via pass skipping (UInt32 in [0,255]: ~0.53 ms). Correctness: the previous CUDA path had never been verified (only timed) and in fact produced wrong output — KA's get_sub_group_local_id() is 1-based, so the ballot lane masks were off by one and no per-digit leader was ever elected. Now verified 34072/34072 CPU tests and 205/205 GPU cases across all 6 element types, block sizes 64-512, and both directions. Changes: - Fused min/max: compute the (min, max) sort-key range in a single reduction over the keys instead of separate minimum + maximum passes. - Skip-pass: skip a byte's histogram+scan+scatter when min and max share the whole suffix from that byte up (correct sufficient condition). - Ballot-aggregated histogram (CUDA): per-warp leaders add same-digit popcounts to per-warp sub-histograms in shared memory (no localmem atomics; Atomix has no such path on CUDA/POCL). O(warps) vs O(N^2). - Ballot intra-block scatter rank (CUDA), 0-based lane fix. - Faster generic scan histogram: precompute each element's digit once instead of recomputing it inside the per-bucket scan (~2.6x that phase). - Tune block_size: default to 512 for radix (vs 256 for merge) via a per-algorithm default; sort block_size now defaults to nothing. Non-CUDA backends use the scan histogram + scatter throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The per-pass exclusive prefix sum allocated its block-prefix buffer on every call. Preallocate it once (sized to the histogram's scan blocks) and pass it as accumulate!'s temp, removing n_passes allocations per sort and shaving ~10% off the accumulate phase. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The driver synchronized the host after every kernel launch (histogram, accumulate!, scatter) on each pass. The kernels are enqueued on a single in-order backend stream, so they are already correctly sequenced; a single synchronize before return preserves the synchronous semantics. ~1.4-1.6x faster with identical results, verified on an RTX 5080 (4M elements, all supported types correct). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…kends _accumulate_block!'s inclusive-scan path re-read the block's last element from global memory (v[(iblock+1)*block_size*2] or v[len]) inside a single-thread branch, after the scan's barriers. That divergent global read is undefined behaviour: it traps with an illegal address on ROCm (AMDGPU) and deadlocks on POCL — only CUDA tolerated it. As a result accumulate!, and everything built on it (e.g. the GPU radix sort), crashed on AMD and hung on POCL for any array spanning more than one workgroup. Load each thread's two elements into registers up front and capture the array's last element v[len] into a one-slot shared buffer, then use those in the inclusive shift instead of re-reading v. The values are identical, so CUDA results are unchanged. Verified on CUDA (RTX 5080), AMD (RX 9060 XT) and oneAPI (Intel Iris Xe): accumulate! (ScanPrefixes) and radix sort correct across sizes and types. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

shreyas-omkar and others added 4 commits July 5, 2026 00:54

shreyas-omkar force-pushed the sh/radix-optim branch 2 times, most recently from 6191014 to 59fdb29 Compare July 4, 2026 19:35

fix:Renamed KernelIntrensics and KernelInterface

b6a54c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize GPU radix sort: ballot kernels, fused range, skip-pass, tuning#93

Optimize GPU radix sort: ballot kernels, fused range, skip-pass, tuning#93
shreyas-omkar wants to merge 5 commits into
JuliaGPU:mainfrom
shreyas-omkar:sh/radix-optim

shreyas-omkar commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shreyas-omkar commented Jul 1, 2026

Benchmarks (RTX 5080, 4M elements)

What changed

Correctness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant