Optimize GPU radix sort: ballot kernels, fused range, skip-pass, tuning#93
Draft
shreyas-omkar wants to merge 5 commits into
Draft
Optimize GPU radix sort: ballot kernels, fused range, skip-pass, tuning#93shreyas-omkar wants to merge 5 commits into
shreyas-omkar wants to merge 5 commits into
Conversation
…ing) Rework the opt-in GPU LSD radix sort for correctness and speed. On an RTX 5080 at 4M elements this is ~2.3x faster than the previous radix implementation for UInt32 and now beats merge sort for every supported type (UInt32 1.05x, Float32 1.37x, Float64 1.94x); small-range/structured data is much faster still via pass skipping (UInt32 in [0,255]: ~0.53 ms). Correctness: the previous CUDA path had never been verified (only timed) and in fact produced wrong output — KA's get_sub_group_local_id() is 1-based, so the ballot lane masks were off by one and no per-digit leader was ever elected. Now verified 34072/34072 CPU tests and 205/205 GPU cases across all 6 element types, block sizes 64-512, and both directions. Changes: - Fused min/max: compute the (min, max) sort-key range in a single reduction over the keys instead of separate minimum + maximum passes. - Skip-pass: skip a byte's histogram+scan+scatter when min and max share the whole suffix from that byte up (correct sufficient condition). - Ballot-aggregated histogram (CUDA): per-warp leaders add same-digit popcounts to per-warp sub-histograms in shared memory (no localmem atomics; Atomix has no such path on CUDA/POCL). O(warps) vs O(N^2). - Ballot intra-block scatter rank (CUDA), 0-based lane fix. - Faster generic scan histogram: precompute each element's digit once instead of recomputing it inside the per-bucket scan (~2.6x that phase). - Tune block_size: default to 512 for radix (vs 256 for merge) via a per-algorithm default; sort block_size now defaults to nothing. Non-CUDA backends use the scan histogram + scatter throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-pass exclusive prefix sum allocated its block-prefix buffer on every call. Preallocate it once (sized to the histogram's scan blocks) and pass it as accumulate!'s temp, removing n_passes allocations per sort and shaving ~10% off the accumulate phase. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The driver synchronized the host after every kernel launch (histogram, accumulate!, scatter) on each pass. The kernels are enqueued on a single in-order backend stream, so they are already correctly sequenced; a single synchronize before return preserves the synchronous semantics. ~1.4-1.6x faster with identical results, verified on an RTX 5080 (4M elements, all supported types correct). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…kends _accumulate_block!'s inclusive-scan path re-read the block's last element from global memory (v[(iblock+1)*block_size*2] or v[len]) inside a single-thread branch, after the scan's barriers. That divergent global read is undefined behaviour: it traps with an illegal address on ROCm (AMDGPU) and deadlocks on POCL — only CUDA tolerated it. As a result accumulate!, and everything built on it (e.g. the GPU radix sort), crashed on AMD and hung on POCL for any array spanning more than one workgroup. Load each thread's two elements into registers up front and capture the array's last element v[len] into a one-slot shared buffer, then use those in the inclusive shift instead of re-reading v. The values are identical, so CUDA results are unchanged. Verified on CUDA (RTX 5080), AMD (RX 9060 XT) and oneAPI (Intel Iris Xe): accumulate! (ScanPrefixes) and radix sort correct across sizes and types. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6191014 to
59fdb29
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #90 (which added the opt-in
RadixSort()). This makes the GPUradix sort substantially faster while keeping it correct on every backend.
On an RTX 5080, 4M elements, default settings,
RadixSort()now beats theexisting GPU
MergeSort()for every supported type and is ~2.3–7.7× fasterthan the radix sort merged in #90.
Benchmarks (RTX 5080, 4M elements)
Structured / small-range data is much faster via pass skipping (e.g.
UInt32values in
[0, 255]sort in a single pass: ~0.53 ms, 7.9 GE/s).What changed
mapreduceover the sort keys instead of separateminimum+maximumpasses. This alone was the largest win (up to ~5× on the range step for
64-bit types, which dominated the previous timing).
max keys share the whole suffix from that byte up — free for structured or
small-range data.
popcounts to per-warp sub-histograms via
sub_group_ballot, i.e. O(warps)per block instead of the O(N²) broadcast scan. No shared-memory atomics
(Atomix has no
@localmematomic path on CUDA/POCL).element's digit once instead of recomputing it inside the per-bucket scan
(~2.6× on that phase); used on all non-CUDA backends.
sortblock_sizenow defaults tonothing, andeach algorithm resolves its own tuned default (merge 256, radix 512); an
explicit
block_sizeis still honored.accumulate!scratch across passes to avoid per-pass allocation.Non-CUDA backends use the scan-based histogram and scatter throughout (the
ballot kernels require a real 32-lane sub-group).
Correctness
64–512, ascending and descending, plus edge cases (all-equal, narrow ranges,
tiny sizes)