Skip to content

Fix DecoupledLookback exclusive carries#91

Draft
maleadt wants to merge 3 commits into
mainfrom
tb/decoupled_loopback
Draft

Fix DecoupledLookback exclusive carries#91
maleadt wants to merge 3 commits into
mainfrom
tb/decoupled_loopback

Conversation

@maleadt

@maleadt maleadt commented Jun 30, 2026

Copy link
Copy Markdown
Member

accumulate!(...; alg=DecoupledLookback()) produces incorrect results for exclusive multi-block scans on non-uniform data. The default ScanPrefixes() path was fixed in #90 (and is what issue #84 reported); this PR tracks the separate, opt-in DecoupledLookback() exclusive bug, which #90 deliberately deferred by excluding it from the non-uniform exclusive test.

What's here

  • test: re-enables DecoupledLookback() coverage in the non-uniform exclusive accumulate test (it's excluded on main).
  • fix (cherry-pick of the reverted a279bf5): corrects the exclusive carries by publishing each block's cumulative aggregate through a dedicated aggregates[] array instead of inferring it from v[block_last] (which only holds the inclusive aggregate for inclusive scans).
  • docs: an inline KNOWN BUG / TODO documenting why this is not mergeable yet.

What's missing

The aggregate fix is arithmetically correct but the publish/consume protocol is not memory-coherent across blocks, so it fails ~40% of the time, nondeterministically, on smaller GPUs (reproduced on an A100 MIG 1g.5gb, sm_80, ~14 SMs) — for both inclusive and exclusive scans. Two distinct problems:

  1. Coherence. aggregates[i] is read with an ordinary (L1-cacheable) load while the companion flags[i] is read atomically. A consumer block can observe a fresh flags[i] == ACC_FLAG_A yet read a stale L1 copy of aggregates[i]. The aggregate needs an L1-bypassing (relaxed/monotonic atomic) load + store, like the flag.
  2. Ordering. The producer's aggregate store must be globally visible before its flag store, and the consumer's aggregate load must happen after its flag load — a device-scope release/acquire ordering.

The blocker: no portable device threadfence

The natural fix for (2) is a device-scope fence, but:

  • UnsafeAtomics.fence (system scope) and acquire/release atomic load/store fail to select on NVPTX with recent toolchains (LLVM 18, sm_80) — only monotonic selects. So the current fence-based code (and main's DL path) doesn't even compile on Julia 1.12; CI only survives because it runs Julia 1.10/1.11 with an older toolchain.
  • The only primitive that lowers on both old and new toolchains is a native device threadfence (membar.gl / CUDA.threadfence() / AMDGPU equivalent), which KernelAbstractions does not expose as a backend-agnostic primitive.

A verified working approach (0 failures over 80k randomized iterations on the A100 MIG, across Int32/Int64/Float32/Float64): make flags and aggregates monotonic atomics, and order them with an overridable _decoupled_fence() — a generic no-op fallback overridden per backend via @device_override in CUDA/AMDGPU package extensions.

maleadt and others added 3 commits June 30, 2026 12:32
…kback

Re-enable DecoupledLookback() coverage in the non-uniform exclusive accumulate
test (excluded on the main branch). This branch carries the work-in-progress
fix for DecoupledLookback's exclusive carries, so it must exercise that path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…need

The decoupled-lookback aggregate publish/consume protocol is not memory-
coherent across blocks (stale L1 aggregate reads + missing device-scope
ordering), causing ~40% failures on smaller GPUs. The proper fix needs a
native device threadfence; `UnsafeAtomics.fence` and acquire/release atomics
do not lower on recent NVPTX toolchains (LLVM 18, sm_80) — only `monotonic`
does. Document this inline as the path forward.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant