fix(cuda): push element count, not byte length, for slice kernel args by haixuanTao · Pull Request #6 · dimforge/khal

haixuanTao · 2026-06-08T12:37:11Z

Problem

The CUDA backend's write_arg pushes a buffer's byte length as the second arg for slice bindings, but a kernel &[T] parameter expects an element count (ptr, len). So a kernel calling slice.len() sees byte_len — e.g. 4× too large for &[u32] — and reads out of bounds.

This is latent for kernels that bound their loops by other means (shape uniforms, explicit counts), but corrupts any kernel that uses slice.len() directly — e.g. an indirect-dispatch setup deriving a workgroup grid from num_keys.len() produces a garbage grid and an illegal memory access.

Fix

In the three CUDA write_arg arms (GpuBuffer / GpuBufferSlice / GpuBufferSliceMut, all generic over T), push byte_len / size_of::<T>() (element count).

Sized arrays (&[T; N]), scalars (&T), and uniforms reconstruct via the pointer and ignore this value, so they are unaffected — only true &[T] slices change, to correct.

Verification

Tested on an RTX 5090 with cuda-oxide-compiled kernels: batched physics + tensor workloads that previously hit CUDA_ERROR_ILLEGAL_ADDRESS via slice.len() now run correctly and bit-exact vs the WebGPU backend.

🤖 Generated with Claude Code

cuda-oxide lowers a `&[T]` kernel parameter to a `(ptr, len)` pair where `len` is an ELEMENT count, but the khal CUDA backend was pushing the buffer's BYTE length. So a kernel calling `slice.len()` got `byte_len` (4x too large for `&[u32]`) and read out of bounds — fatal in the batched broad-phase radix sort (`gpu_init_sort_dispatch` derives the workgroup grid from `num_keys_arr.len()`, producing a garbage indirect dispatch and an illegal memory access) and in the LBVH traversal (`*_len.len()`). Fix: in the three CUDA `write_arg` arms (`GpuBuffer`/`GpuBufferSlice`/ `GpuBufferSliceMut`, all generic over `T`), push `byte_len / size_of::<T>()`. Sized arrays (`&[T; N]`), scalars (`&T`) and uniforms reconstruct via the pointer and ignore this value, so they are unaffected; only true `&[T]` slices change — to correct. Latent in vortx (its kernels bound loops by `Shape` uniforms, never `slice.len()`) and in single-env physics; surfaced once N>1 batched physics ran on native CUDA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

https://claude.ai/code/session_01PKvqHpCVv3JN7wHCjUG6HP

haixuanTao · 2026-06-16T18:00:07Z

Fixed formatting

haixuanTao and others added 2 commits June 8, 2026 14:36

style: fix rustfmt formatting on slice element-count fix

2552c99

https://claude.ai/code/session_01PKvqHpCVv3JN7wHCjUG6HP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cuda): push element count, not byte length, for slice kernel args#6

fix(cuda): push element count, not byte length, for slice kernel args#6
haixuanTao wants to merge 2 commits into
dimforge:mainfrom
haixuanTao:upstream/cuda-slice-arg-element-count

haixuanTao commented Jun 8, 2026

Uh oh!

haixuanTao commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haixuanTao commented Jun 8, 2026

Problem

Fix

Verification

Uh oh!

haixuanTao commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants