Skip to content

fix(cuda): push element count, not byte length, for slice kernel args#6

Open
haixuanTao wants to merge 2 commits into
dimforge:mainfrom
haixuanTao:upstream/cuda-slice-arg-element-count
Open

fix(cuda): push element count, not byte length, for slice kernel args#6
haixuanTao wants to merge 2 commits into
dimforge:mainfrom
haixuanTao:upstream/cuda-slice-arg-element-count

Conversation

@haixuanTao

Copy link
Copy Markdown
Contributor

Problem

The CUDA backend's write_arg pushes a buffer's byte length as the second arg for slice bindings, but a kernel &[T] parameter expects an element count (ptr, len). So a kernel calling slice.len() sees byte_len — e.g. 4× too large for &[u32] — and reads out of bounds.

This is latent for kernels that bound their loops by other means (shape uniforms, explicit counts), but corrupts any kernel that uses slice.len() directly — e.g. an indirect-dispatch setup deriving a workgroup grid from num_keys.len() produces a garbage grid and an illegal memory access.

Fix

In the three CUDA write_arg arms (GpuBuffer / GpuBufferSlice / GpuBufferSliceMut, all generic over T), push byte_len / size_of::<T>() (element count).

Sized arrays (&[T; N]), scalars (&T), and uniforms reconstruct via the pointer and ignore this value, so they are unaffected — only true &[T] slices change, to correct.

Verification

Tested on an RTX 5090 with cuda-oxide-compiled kernels: batched physics + tensor workloads that previously hit CUDA_ERROR_ILLEGAL_ADDRESS via slice.len() now run correctly and bit-exact vs the WebGPU backend.

🤖 Generated with Claude Code

haixuanTao and others added 2 commits June 8, 2026 14:36
cuda-oxide lowers a `&[T]` kernel parameter to a `(ptr, len)` pair where
`len` is an ELEMENT count, but the khal CUDA backend was pushing the
buffer's BYTE length. So a kernel calling `slice.len()` got `byte_len`
(4x too large for `&[u32]`) and read out of bounds — fatal in the batched
broad-phase radix sort (`gpu_init_sort_dispatch` derives the workgroup
grid from `num_keys_arr.len()`, producing a garbage indirect dispatch and
an illegal memory access) and in the LBVH traversal (`*_len.len()`).

Fix: in the three CUDA `write_arg` arms (`GpuBuffer`/`GpuBufferSlice`/
`GpuBufferSliceMut`, all generic over `T`), push `byte_len / size_of::<T>()`.
Sized arrays (`&[T; N]`), scalars (`&T`) and uniforms reconstruct via the
pointer and ignore this value, so they are unaffected; only true `&[T]`
slices change — to correct. Latent in vortx (its kernels bound loops by
`Shape` uniforms, never `slice.len()`) and in single-env physics; surfaced
once N>1 batched physics ran on native CUDA.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@haixuanTao

Copy link
Copy Markdown
Contributor Author

Fixed formatting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants