Skip to content

Design: cuda.core Texture/Surface API surface  #2188

@rparolin

Description

@rparolin

Purpose

Design discussion for the texture / surface API in cuda.core — to settle the API shape and
naming before code review of the implementation. Reviewers asked for design sign-off in an issue
before we commit to a ~9k-line feature.

cc @leofang @mdboom @Andy-Jost @kkraus14 — you asked for a design pass; this is the home for it.

Proposed public surface (from #2095)

  • Array + ArrayFormat — opaque, hardware-laid-out GPU allocations backing textures/surfaces.
  • MipmappedArray — wraps CUmipmappedArray; get_level returns a non-owning Array kept alive
    by a strong ref to the parent.
  • TextureObject + TextureDescriptor — bindless texture handle + sampling state.
  • SurfaceObject — bindless surface handle; requires Array(surface_load_store=True).
  • ResourceDescriptor — factories from_array, from_mipmapped_array, from_linear, from_pitch2d.

Decisions to make

  1. Name of Array. ✅ Decided — rename ArrayCUDAArray.

    This type is an opaque cudaArray_t — the GPU stores it in a scrambled, hardware-defined layout
    with no linear pointer, so it cannot expose __cuda_array_interface__ / DLPack and cannot
    share memory zero-copy with cupy / numba-cuda / torch. The name Array implies an n-dimensional
    array that participates in that ecosystem — it can't. CuPy names the identical type CUDAarray,
    and its whole cupy.cuda.texture module already matches this PR's surface 1:1.

    Resolution: use CUDAArray — the PEP 8 CapWords form (deliberately differing from CuPy's exact
    CUDAarray casing to follow Python's class-naming standard). The name signals "CUDA texture/surface
    backing store," not "n-dimensional array."

    Open detail resolved: keep ArrayFormat (do not rename to CUDAArrayFormat). The sibling
    enums in these modules — AddressMode, FilterMode, ReadMode — are all unprefixed, so
    ArrayFormat matches the established enum-naming pattern; and the "Array implies an
    ndarray/DLPack participant" concern that motivated CUDAArray does not apply to a format enum
    (nobody mistakes ArrayFormat for an n-dimensional array).

  2. Interop path. ✅ Decided — ship only copy_from / copy_to; no allocation helper.

    Zero-copy is impossible (opaque layout, no linear pointer), so copying is the only option —
    this was purely about how polished the path is. The copy path to/from linear cuda.core
    Buffers already exists: copy_from / copy_to accept a device Buffer or a host
    buffer-protocol object, in both directions. The only thing an extra helper would add is
    allocating the linear Buffer for the caller — folding mr.allocate(arr.size_bytes, stream=s)

    • arr.copy_to(buf, stream=s) into a one-liner, i.e. ~2 lines of convenience.

    Resolution: ship copy_from / copy_to only, and document the copy-only contract. We will
    not add an allocating convenience helper now. It is purely additive and non-breaking, so we can
    add one later if users request it.

  3. Factory set. ✅ Not a real decision — driver-mandated, all four required.

    A texture can be backed by four kinds of memory — the PR exposes one factory per kind:

    • from_array — texture over a CUDAArray (the headline feature)
    • from_mipmapped_array — texture over a MipmappedArray (the headline feature)
    • from_linear — texture over a plain 1D device buffer (ordinary linear memory, no CUDAArray)
    • from_pitch2d — texture over a plain 2D pitched buffer (ordinary linear memory, no CUDAArray)

    ResourceDescriptor binds CUDA_RESOURCE_DESC, a driver union whose resType is exactly one of
    ARRAY / MIPMAPPED_ARRAY / LINEAR / PITCH2D — one factory per union arm. A faithful binding
    of that type must cover all four; shipping only two would be an incomplete binding of a mandatory
    driver struct, not a smaller-but-valid surface. So there was no real optionality here — the CTK
    driver API dictates the set. (Listed only because it sat next to the genuine decisions.)

    Resolution: ship all four factories — required by the driver API, not a tradeoff.

  4. Channel format. ✅ Decided — keep the folded format + num_channels parameters.

    Each array element has a component type (e.g. 8-bit uint, 32-bit float) and a channel count
    (1 = grayscale … 4 = RGBA). Two ways to surface that:

    • Folded (this PR): CUDAArray.from_descriptor(shape=..., format=ArrayFormat.FLOAT32, num_channels=4)
    • Separate (CuPy): one ChannelFormatDescriptor(...) object passed as a unit

    The driver descriptor cuda.core actually fills in (CUDA_ARRAY3D_DESCRIPTOR) already stores
    these as two separate fields — Format (a CUarray_format, mirrored 1:1 by ArrayFormat) and
    NumChannels. So the folded form maps straight onto the driver struct with no translation, and
    read-back is already exposed as two properties (.format, .num_channels). The bundled
    ChannelFormatDescriptor is the runtime API's (cudaChannelFormatDesc) modeling — the form
    CuPy wraps because its texture module sits on the runtime API. Adopting it in a driver-based
    library would mean a translation wrapper the underlying API doesn't use (and the shapes don't even
    map cleanly: the driver uses one uniform component format × channel count, while
    cudaChannelFormatDesc allows per-channel bit widths).

    Resolution: keep folded format + num_channels. It's the driver-faithful surface
    (consistent with Renaming #1 favoring correctness over CuPy parity and Docs Updates #3 following the driver API); the
    bundled form's only wins are CuPy look-alike and a single read-back object, neither worth a
    runtime-style wrapper here.

  5. Descriptor type consistency. ✅ Not a real decision — divergence is intentional and harmless.

    Note: the original framing here was factually wrong. ResourceDescriptor is not a cdef class and holds no native C struct — it is a plain Python class with __slots__, storing a
    reference to the backing resource plus a few Python fields. The CUDA_RESOURCE_DESC struct is
    assembled later, in TextureObject.from_descriptor. So this was never a @dataclass-vs-cdef class / performance question. Both descriptors are pure Python.

    The genuine difference is only how you construct each, and it reflects what each type is:

    • TextureDescriptor — a flat bag of independent sampling settings, built directly with keyword
      args (@dataclass fits perfectly).
    • ResourceDescriptor — a "pick exactly one of four backings" union (array / mipmap / linear /
      pitch2d), built via from_* factories because each kind carries different fields. A single
      __init__ would be a pile of mutually-exclusive optional args plus a kind tag.

    Consistency is not a goal in itself — it only matters when inconsistency makes the API harder to
    learn or use, and here it doesn't: a user learns each type once and never has to reconcile them.
    The only behavioral gap is equality (TextureDescriptor compares by value; ResourceDescriptor
    by identity), which is essentially never exercised on these objects and is arguably correct since
    ResourceDescriptor wraps a live device resource. Forcing both to the same kind of type would
    be uniformity for its own sake and would make ResourceDescriptor's constructor worse.

    Resolution: keep the split — the divergence is intentional and does not hurt usability. Like
    Docs Updates #3, this resolves to a non-issue once examined (and on a mistaken premise to begin with).

  6. Bool naming. ✅ Decided — adopt the is_<something> convention.

    surface_load_store is a boolean on Array: it records whether the array was created with the
    surface load/store capability (CUDA's CUDA_ARRAY3D_SURFACE_LDST), which a SurfaceObject
    requires. Exposed both as a constructor keyword (surface_load_store=True) and a read-only
    property (arr.surface_load_store).

    The repo convention for boolean properties is is_<something>, so a property named
    surface_load_store doesn't read as a boolean the way arr.is_managed does. Resolution: rename
    the property to follow the is_<x> convention (e.g. is_surface_load_store) for consistency with
    the cuda-python codebase.

    Open detail resolved: the property name is is_surface_load_store (already implemented), and
    the constructor keyword is renamed to match — from_descriptor(..., is_surface_load_store=False)
    so one symmetric name serves both set and read-back. This follows the existing
    StridedMemoryView(is_readonly=...) precedent in cuda.core, where an is_<x> boolean is used as
    both the constructor argument and the attribute. (The keyword rename is a small implementation
    follow-up in the PR; the property is already done.)

  7. Scope. ✅ Decided — split the examples into a follow-up PR.

    The nine gl_interop_*.py examples (~5k lines, not CI-wired, need a GL context CI lacks) are
    orthogonal to the core API. Resolution: drop them from this PR and land them in a separate
    follow-up PR once this core texture/surface PR merges
    , since the examples depend on the new API
    it introduces.

Metadata

Metadata

Assignees

Labels

RFCPlans and announcementscuda.coreEverything related to the cuda.core modulefeatureNew feature or request
No fields configured for Enhancement.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions