Fix DecoupledLookback exclusive carries by maleadt · Pull Request #91 · JuliaGPU/AcceleratedKernels.jl

maleadt · 2026-06-30T16:33:28Z

accumulate!(...; alg=DecoupledLookback()) produces incorrect results for exclusive multi-block scans on non-uniform data. The default ScanPrefixes() path was fixed in #90 (and is what issue #84 reported); this PR tracks the separate, opt-in DecoupledLookback() exclusive bug, which #90 deliberately deferred by excluding it from the non-uniform exclusive test.

What's here

test: re-enables DecoupledLookback() coverage in the non-uniform exclusive accumulate test (it's excluded on main).
fix (cherry-pick of the reverted a279bf5): corrects the exclusive carries by publishing each block's cumulative aggregate through a dedicated aggregates[] array instead of inferring it from v[block_last] (which only holds the inclusive aggregate for inclusive scans).
docs: an inline KNOWN BUG / TODO documenting why this is not mergeable yet.

What's missing

The aggregate fix is arithmetically correct but the publish/consume protocol is not memory-coherent across blocks, so it fails ~40% of the time, nondeterministically, on smaller GPUs (reproduced on an A100 MIG 1g.5gb, sm_80, ~14 SMs) — for both inclusive and exclusive scans. Two distinct problems:

Coherence. aggregates[i] is read with an ordinary (L1-cacheable) load while the companion flags[i] is read atomically. A consumer block can observe a fresh flags[i] == ACC_FLAG_A yet read a stale L1 copy of aggregates[i]. The aggregate needs an L1-bypassing (relaxed/monotonic atomic) load + store, like the flag.
Ordering. The producer's aggregate store must be globally visible before its flag store, and the consumer's aggregate load must happen after its flag load — a device-scope release/acquire ordering.

The blocker: no portable device threadfence

The natural fix for (2) is a device-scope fence, but:

UnsafeAtomics.fence (system scope) and acquire/release atomic load/store fail to select on NVPTX with recent toolchains (LLVM 18, sm_80) — only monotonic selects. So the current fence-based code (and main's DL path) doesn't even compile on Julia 1.12; CI only survives because it runs Julia 1.10/1.11 with an older toolchain.
The only primitive that lowers on both old and new toolchains is a native device threadfence (membar.gl / CUDA.threadfence() / AMDGPU equivalent), which KernelAbstractions does not expose as a backend-agnostic primitive.

A verified working approach (0 failures over 80k randomized iterations on the A100 MIG, across Int32/Int64/Float32/Float64): make flags and aggregates monotonic atomics, and order them with an overridable _decoupled_fence() — a generic no-op fallback overridden per backend via @device_override in CUDA/AMDGPU package extensions.

…kback Re-enable DecoupledLookback() coverage in the non-uniform exclusive accumulate test (excluded on the main branch). This branch carries the work-in-progress fix for DecoupledLookback's exclusive carries, so it must exercise that path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…need The decoupled-lookback aggregate publish/consume protocol is not memory- coherent across blocks (stale L1 aggregate reads + missing device-scope ordering), causing ~40% failures on smaller GPUs. The proper fix needs a native device threadfence; `UnsafeAtomics.fence` and acquire/release atomics do not lower on recent NVPTX toolchains (LLVM 18, sm_80) — only `monotonic` does. Document this inline as the path forward. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

maleadt and others added 3 commits June 30, 2026 12:32

fix(scan): correct DecoupledLookback exclusive carries

d4a6160

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix DecoupledLookback exclusive carries#91

Fix DecoupledLookback exclusive carries#91
maleadt wants to merge 3 commits into
mainfrom
tb/decoupled_loopback

maleadt commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

maleadt commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant