Skip to content

Add Advanced Vulkan Compute tutorial #334

Open
gpx1000 wants to merge 6 commits into
KhronosGroup:mainfrom
gpx1000:Vulkan-Compute-advanced-tutorial
Open

Add Advanced Vulkan Compute tutorial #334
gpx1000 wants to merge 6 commits into
KhronosGroup:mainfrom
gpx1000:Vulkan-Compute-advanced-tutorial

Conversation

@gpx1000

@gpx1000 gpx1000 commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

sections on memory models OpenCL interoperation and SYCL interoperation.

Add comprehensive documentation covering Vulkan Memory Model (availability/visibility/domain operations), shared memory (LDS) with bank conflict details, memory consistency with GroupMemoryBarrierWithGroupSync, OpenCL C to SPIR-V pipeline (clspv), kernel portability guidelines, clvk layering, and tutorial conclusion. Include navigation entries for all new compute architecture sections.

gpx1000 added 2 commits March 16, 2026 15:36
…L, and conclusion

Add comprehensive documentation covering Vulkan Memory Model (availability/visibility/domain operations), shared memory (LDS) with bank conflict details, memory consistency with GroupMemoryBarrierWithGroupSync, OpenCL C to SPIR-V pipeline (clspv), kernel portability guidelines, clvk layering, and tutorial conclusion. Include navigation entries for all new compute architecture sections.
Add missing blank lines after list introduction paragraphs to ensure proper Markdown rendering of bullet points in "Developing for advanced compute" and "Community and Resources" sections.

@bashbaug bashbaug left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice tutorial but I think the way the "OpenCL on Vulkan" and "SYCL on Vulkan" topics are presented is a little confusing. If an OpenCL on Vulkan implementation has done its job, it would look and behave like any other OpenCL implementation. Note that we considered at one point whether OpenCL would need a "Vulkan profile" or similar for a layered implementation, but this does not seem to be required, and conformant implementations of OpenCL over Vulkan are shipping.

Would it make more sense to focus more on the advanced Vulkan compute features that these layered implementations are using, instead? The current chapter about unified shared memory is a good example. Are there other similar features that could be described in this tutorial?

@gpx1000

gpx1000 commented Mar 25, 2026

Copy link
Copy Markdown
Contributor Author

Hi Ben, Thanks for the review. I can certainly remove the OpenCL and SYCL chapters and add more advanced features. We do have a VulkanML tutorial that covers some of the more advanced compute things that might go there instead (you can find that one in the gitlab MR for Vulkan-Tutorial.

I was hoping to provide a method of exposing developers learning Vulkan to OpenCL and SYCL, This seemed like the right topic area to provide such intersection point. However, I don't feel like it would hurt Vulkan's compute tutorial as a concept to remove OpenCL and SYCL entirely. Alternatively, I could add more details about both and make it a longer tutorial. Do you have guidance for what you think would make sense?

@gpx1000

gpx1000 commented Mar 25, 2026

Copy link
Copy Markdown
Contributor Author

This is a nice tutorial but I think the way the "OpenCL on Vulkan" and "SYCL on Vulkan" topics are presented is a little confusing. If an OpenCL on Vulkan implementation has done its job, it would look and behave like any other OpenCL implementation. Note that we considered at one point whether OpenCL would need a "Vulkan profile" or similar for a layered implementation, but this does not seem to be required, and conformant implementations of OpenCL over Vulkan are shipping.

Would it make more sense to focus more on the advanced Vulkan compute features that these layered implementations are using, instead? The current chapter about unified shared memory is a good example. Are there other similar features that could be described in this tutorial?

I re-read this. I see what you're asking I think. Yes. Lemme iterate a bit and I'll come up with more things to add in for the similar feature set.

gpx1000 added 2 commits June 9, 2026 22:56
…an Compute chapter

Address all feedback received thus far.
Add accompanying samples for all chapters.
@gpx1000 gpx1000 requested review from EwanC and bashbaug June 11, 2026 04:35
@SaschaWillems

Copy link
Copy Markdown
Collaborator

Can you sync with master so the changes from #388 are available here?

@gpx1000

gpx1000 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Sorry didn't notify you that I did the merge :). It should be good to go for that issue.

@SaschaWillems

Copy link
Copy Markdown
Collaborator

Thanks 👍🏻

Going through the samples (which are pretty interesting btw.). I noticed two issues.

The BVH one looks like geometry is missing (should be a full cornell box I guess):

image

The performance optimization sample seems to have issues properly displaying the time unit in the windows terminal:

image

…ce optimization to output ASCII instead of UTF-8 char.
@gpx1000

gpx1000 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Both issues should be fixed.

@SaschaWillems

Copy link
Copy Markdown
Collaborator

Can you also add .gitignore entries for "attachments\compute" (similar to the SGE folder)? Otherwise git wants to check in all the build files e.g. from Visual Studio.

@SaschaWillems

Copy link
Copy Markdown
Collaborator

The BVH geometry is fixed, but it looks like that code has some sync issues. I can see randomly flickering artefacts depending on camera position:

image

Sync validation reports several write-after-read and write-after-write hazards.

@SaschaWillems

Copy link
Copy Markdown
Collaborator

Most of the samples report similar sync errors.

.regionCount = 1,
.pRegions = &region,
.filter = vk::Filter::eNearest};
cb.blitImage2(blitInfo);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik there is no guarantee that all image formats support blit (as dst), but I don't see code checking if the swapchain format supports blits. Applies to pretty much all of the compute samples.

.semaphoreCount = 1,
.pSemaphores = &*timelineSemaphore,
.pValues = &graphicsSignalVal};
auto r = device.waitSemaphores(swi, UINT64_MAX);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that CPU wait cause a frame serialization, thus defeating any chance of graphics/compute overlap?

cb.bindPipeline(vk::PipelineBindPoint::eCompute, computePipeline);
cb.bindDescriptorSets(vk::PipelineBindPoint::eCompute,
computePipelineLayout, 0,
{computeDescSets[frameIndex]}, {});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use the next frame index (after frame index) instead (for async/overlap).

vk::MemoryPropertyFlags props,
vk::raii::Buffer &buf, vk::raii::DeviceMemory &mem) const
{
buf = vk::raii::Buffer(device, {.size = size, .usage = usage, .sharingMode = vk::SharingMode::eExclusive});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using exclusive sharing mode, but I'm not seeing any queue ownership transfers. Either add them, or don't use exclusive sharing mode. Afaik it doesn't matter for buffers anyway.

auto &acqSem = acquireSemaphores[acquireSemIdx];
acquireSemIdx = (acquireSemIdx + 1) % (MAX_FRAMES + 1);

auto [acqResult, imageIndex] = swapChain.acquireNextImage(UINT64_MAX, *acqSem, nullptr);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire result is never checked (e.g. out of date).

// family is preferred; the code falls back to a second queue from the
// graphics family, or to sharing queue 0 if only one queue is available.
//
// - The GRAPHICS QUEUE renders a 64×64 cloth mesh as a lit triangle mesh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but comment says 64x64 mesh. while the constant a few lines below use 32x32.

constexpr uint32_t WIDTH = 1280;
constexpr uint32_t HEIGHT = 720;
// Cloth is 32×32 = 1024 vertices so the entire grid fits in ONE compute
// workgroup (maxComputeWorkGroupInvocations is guaranteed ≥ 1024 across all

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think that's true. Vulkan 1.0 core requires 128, Vk Roadmap 2022 / Vk 1.4 up this ti 256

// =======================================================================
bool isDeviceSuitable(vk::raii::PhysicalDevice const &pd)
{
bool ok13 = pd.getProperties().apiVersion >= VK_API_VERSION_1_3;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks for 1.3 (and up), while createInstance requests 1.4

.semaphoreCount = 1,
.pSemaphores = &*timelineSema,
.pValues = &graphicsSignal};
if (device.waitSemaphores(swi, UINT64_MAX) != vk::Result::eSuccess)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that host wait defeat frames in flight?

.applicationVersion = VK_MAKE_VERSION(1, 0, 0),
.pEngineName = "No Engine",
.engineVersion = VK_MAKE_VERSION(1, 0, 0),
.apiVersion = vk::ApiVersion14};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sample doesn't seem to use anything that requires Vulkan 1.4. I'd suggest lowering this to 1.3 to support a wider range of devices.

* **100% Occupancy**: The CU is completely packed with bundles. Whenever one waits for memory, there's almost certainly another one ready to go.
* **Low Occupancy**: Only a few bundles are active. If they all hit a memory fetch at the same time, the CU will stall.

=== The Resource Tug-of-War

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had to look up what "Tug-of-War" means. This, and other similar terms used throughout the text, make it hard to read for non-native speaker. Maybe replace them with more common or actual technical terms.

@@ -0,0 +1,67 @@
:pp: {plus}{plus}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole chapter feels overly generic and unspecific. What is a "modern AI assistent"? Which LLMs should be used?

An AI assistant can instantly recognize this

This could be true, this could be false. Without knowing what AI assistant was used, this statement can't be verified. Quality of LLM answers in regards to Vulkan from experience varies heavily.

Might be better to remove this for now and maybe add as a separate PR?


This is where **GPU-Assisted Validation** (GAV) comes in. Part of the standard Vulkan Validation Layers, GAV works by injecting small amounts of diagnostic code directly into your shaders at runtime. This **instrumentation** allows the layers to track and report errors that would otherwise be invisible.

=== Enabling GAV in C++

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention that GPU-AV can also be enabled via vkconfig.


Debugging a compute shader is notoriously difficult. Unlike CPU code, you can't easily set a breakpoint or step through your logic line by line. Most errors—like an out-of-bounds buffer access—will simply result in garbage data or, in the worst-case scenario, a "Device Lost" error that provides almost no information about what went wrong.

== GPU-Assisted Validation (GAV)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but validation layer documentation and UI us the GPU-AV as an abbreviation.

* **FP16 (IEEE 754)**: 1 sign bit, 5 exponent bits, 10 mantissa bits. This provides decent precision but a very limited range (max value ~65,504).
* **BFloat16 (Brain Float)**: 1 sign bit, 8 exponent bits, 7 mantissa bits. This has the *same range* as FP32 but much lower precision. It's often preferred for machine learning because it's more robust to overflows.

In Vulkan, FP16 is widely supported via the `VK_KHR_shader_float16_int8` extension, while BFloat16 is typically accessed through the `VK_KHR_shader_float_controls` or vendor-specific extensions.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BFloat is supported via VK_KHR_shader_bfloat16


If you've spent any time with Vulkan, you know the pain of **Descriptor Sets**. Managing layouts, updating pools, and binding sets before every draw or dispatch call is one of the most boilerplate-heavy parts of the API.

But what if you didn't have to bind anything? What if you could just pass a raw 64-bit address to your shader and have it access the memory directly, just like a pointer in C{pp}? This is what **Buffer Device Address (BDA)** allows.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the name "buffer device address" might imply this, it might still be worth noting that this only replaces descriptors for buffers and not images.


== What is BDA?

**Buffer Device Address** (available since Vulkan 1.2 and core in 1.4) allows you to query a 64-bit GPU address for any `VkBuffer`. This address is a raw pointer that can be stored in other buffers, passed to shaders via push constants, or even used to build complex, linked data structures across different memory regions.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"available since Vulkm 1.2..." is a bit misleading. They're available with 1.1 using the kHR extension. They're core since 1.2 and are mandatory since 1.3 (not 1.4)-.


* **Minimize Divergence**: If all 32 threads in a subgroup access the same texture, the hardware only needs to do one load. If all 32 threads access *different* textures, the load operation might take up to 32 times longer.
* **Subgroup Sorting**: If you have a large workload, consider sorting it so that threads in the same subgroup are more likely to access the same or nearby resources.
* **Vulkan 1.4 Features**: Modern Vulkan 1.4 hardware often has better support for non-uniform access, sometimes even avoiding the full scalarization loop for certain resource types.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? If so, cam this be a bit more specific? Like what is "Modern Vulkan 1.4 hardware"?


=== The Explicit Nature of Vulkan

On the CPU, you're used to a world where memory is generally **coherent**. If Thread A writes a value to a variable, Thread B can usually read it shortly after without any special ceremony because the hardware keeps the caches in sync automatically.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true. Coherency in the CPU world is more about what the different cores see, but you still need to properly sync writes and reads across cpu threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants