Skip to content

Optimize GPU radix sort: ballot kernels, fused range, skip-pass, tuning#93

Draft
shreyas-omkar wants to merge 5 commits into
JuliaGPU:mainfrom
shreyas-omkar:sh/radix-optim
Draft

Optimize GPU radix sort: ballot kernels, fused range, skip-pass, tuning#93
shreyas-omkar wants to merge 5 commits into
JuliaGPU:mainfrom
shreyas-omkar:sh/radix-optim

Conversation

@shreyas-omkar

Copy link
Copy Markdown
Member

Follow-up to #90 (which added the opt-in RadixSort()). This makes the GPU
radix sort substantially faster while keeping it correct on every backend.

On an RTX 5080, 4M elements, default settings, RadixSort() now beats the
existing GPU MergeSort() for every supported type and is ~2.3–7.7× faster
than the radix sort merged in #90.

Benchmarks (RTX 5080, 4M elements)

Type RadixSort (this PR) MergeSort RadixSort vs Merge
Int32 1.44 ms 1.54 ms 1.05× faster
Float32 1.28 ms 1.75 ms 1.37× faster
Float64 2.37 ms 4.81 ms 1.94× faster

Structured / small-range data is much faster via pass skipping (e.g. UInt32
values in [0, 255] sort in a single pass: ~0.53 ms, 7.9 GE/s).

What changed

  • Fused min/max range. The pass-skip range is now computed with a single
    mapreduce over the sort keys instead of separate minimum + maximum
    passes. This alone was the largest win (up to ~5× on the range step for
    64-bit types, which dominated the previous timing).
  • Skip-pass. A byte's histogram+scan+scatter is skipped when the min and
    max keys share the whole suffix from that byte up — free for structured or
    small-range data.
  • Ballot-aggregated histogram (CUDA). Per-warp leaders add same-digit
    popcounts to per-warp sub-histograms via sub_group_ballot, i.e. O(warps)
    per block instead of the O(N²) broadcast scan. No shared-memory atomics
    (Atomix has no @localmem atomic path on CUDA/POCL).
  • Ballot intra-block scatter rank (CUDA).
  • Faster generic histogram. The scan histogram now precomputes each
    element's digit once instead of recomputing it inside the per-bucket scan
    (~2.6× on that phase); used on all non-CUDA backends.
  • block_size tuning. GPU sort block_size now defaults to nothing, and
    each algorithm resolves its own tuned default (merge 256, radix 512); an
    explicit block_size is still honored.
  • Reuse accumulate! scratch across passes to avoid per-pass allocation.

Non-CUDA backends use the scan-based histogram and scatter throughout (the
ballot kernels require a real 32-lane sub-group).

Correctness

  • CPU (POCL) suite: 34,072 / 34,072 pass.
  • GPU (CUDA, RTX 5080): 205 / 205 across all six element types, block sizes
    64–512, ascending and descending, plus edge cases (all-equal, narrow ranges,
    tiny sizes)

shreyas-omkar and others added 4 commits July 5, 2026 00:54
…ing)

Rework the opt-in GPU LSD radix sort for correctness and speed. On an
RTX 5080 at 4M elements this is ~2.3x faster than the previous radix
implementation for UInt32 and now beats merge sort for every supported
type (UInt32 1.05x, Float32 1.37x, Float64 1.94x); small-range/structured
data is much faster still via pass skipping (UInt32 in [0,255]: ~0.53 ms).

Correctness: the previous CUDA path had never been verified (only timed)
and in fact produced wrong output — KA's get_sub_group_local_id() is
1-based, so the ballot lane masks were off by one and no per-digit leader
was ever elected. Now verified 34072/34072 CPU tests and 205/205 GPU
cases across all 6 element types, block sizes 64-512, and both directions.

Changes:
- Fused min/max: compute the (min, max) sort-key range in a single
  reduction over the keys instead of separate minimum + maximum passes.
- Skip-pass: skip a byte's histogram+scan+scatter when min and max share
  the whole suffix from that byte up (correct sufficient condition).
- Ballot-aggregated histogram (CUDA): per-warp leaders add same-digit
  popcounts to per-warp sub-histograms in shared memory (no localmem
  atomics; Atomix has no such path on CUDA/POCL). O(warps) vs O(N^2).
- Ballot intra-block scatter rank (CUDA), 0-based lane fix.
- Faster generic scan histogram: precompute each element's digit once
  instead of recomputing it inside the per-bucket scan (~2.6x that phase).
- Tune block_size: default to 512 for radix (vs 256 for merge) via a
  per-algorithm default; sort block_size now defaults to nothing.

Non-CUDA backends use the scan histogram + scatter throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-pass exclusive prefix sum allocated its block-prefix buffer on
every call. Preallocate it once (sized to the histogram's scan blocks)
and pass it as accumulate!'s temp, removing n_passes allocations per
sort and shaving ~10% off the accumulate phase.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The driver synchronized the host after every kernel launch (histogram,
accumulate!, scatter) on each pass. The kernels are enqueued on a single
in-order backend stream, so they are already correctly sequenced; a
single synchronize before return preserves the synchronous semantics.

~1.4-1.6x faster with identical results, verified on an RTX 5080 (4M
elements, all supported types correct).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…kends

_accumulate_block!'s inclusive-scan path re-read the block's last element
from global memory (v[(iblock+1)*block_size*2] or v[len]) inside a
single-thread branch, after the scan's barriers. That divergent global
read is undefined behaviour: it traps with an illegal address on ROCm
(AMDGPU) and deadlocks on POCL — only CUDA tolerated it. As a result
accumulate!, and everything built on it (e.g. the GPU radix sort), crashed
on AMD and hung on POCL for any array spanning more than one workgroup.

Load each thread's two elements into registers up front and capture the
array's last element v[len] into a one-slot shared buffer, then use those
in the inclusive shift instead of re-reading v. The values are identical,
so CUDA results are unchanged.

Verified on CUDA (RTX 5080), AMD (RX 9060 XT) and oneAPI (Intel Iris Xe):
accumulate! (ScanPrefixes) and radix sort correct across sizes and types.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@shreyas-omkar shreyas-omkar force-pushed the sh/radix-optim branch 2 times, most recently from 6191014 to 59fdb29 Compare July 4, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant