Add Advanced Vulkan Compute tutorial by gpx1000 · Pull Request #334 · KhronosGroup/Vulkan-Tutorial

gpx1000 · 2026-03-16T22:39:39Z

sections on memory models OpenCL interoperation and SYCL interoperation.

Add comprehensive documentation covering Vulkan Memory Model (availability/visibility/domain operations), shared memory (LDS) with bank conflict details, memory consistency with GroupMemoryBarrierWithGroupSync, OpenCL C to SPIR-V pipeline (clspv), kernel portability guidelines, clvk layering, and tutorial conclusion. Include navigation entries for all new compute architecture sections.

…L, and conclusion Add comprehensive documentation covering Vulkan Memory Model (availability/visibility/domain operations), shared memory (LDS) with bank conflict details, memory consistency with GroupMemoryBarrierWithGroupSync, OpenCL C to SPIR-V pipeline (clspv), kernel portability guidelines, clvk layering, and tutorial conclusion. Include navigation entries for all new compute architecture sections.

Add missing blank lines after list introduction paragraphs to ensure proper Markdown rendering of bullet points in "Developing for advanced compute" and "Community and Resources" sections.

bashbaug

This is a nice tutorial but I think the way the "OpenCL on Vulkan" and "SYCL on Vulkan" topics are presented is a little confusing. If an OpenCL on Vulkan implementation has done its job, it would look and behave like any other OpenCL implementation. Note that we considered at one point whether OpenCL would need a "Vulkan profile" or similar for a layered implementation, but this does not seem to be required, and conformant implementations of OpenCL over Vulkan are shipping.

Would it make more sense to focus more on the advanced Vulkan compute features that these layered implementations are using, instead? The current chapter about unified shared memory is a good example. Are there other similar features that could be described in this tutorial?

gpx1000 · 2026-03-25T21:55:03Z

Hi Ben, Thanks for the review. I can certainly remove the OpenCL and SYCL chapters and add more advanced features. We do have a VulkanML tutorial that covers some of the more advanced compute things that might go there instead (you can find that one in the gitlab MR for Vulkan-Tutorial.

I was hoping to provide a method of exposing developers learning Vulkan to OpenCL and SYCL, This seemed like the right topic area to provide such intersection point. However, I don't feel like it would hurt Vulkan's compute tutorial as a concept to remove OpenCL and SYCL entirely. Alternatively, I could add more details about both and make it a longer tutorial. Do you have guidance for what you think would make sense?

gpx1000 · 2026-03-25T23:08:41Z

This is a nice tutorial but I think the way the "OpenCL on Vulkan" and "SYCL on Vulkan" topics are presented is a little confusing. If an OpenCL on Vulkan implementation has done its job, it would look and behave like any other OpenCL implementation. Note that we considered at one point whether OpenCL would need a "Vulkan profile" or similar for a layered implementation, but this does not seem to be required, and conformant implementations of OpenCL over Vulkan are shipping.

Would it make more sense to focus more on the advanced Vulkan compute features that these layered implementations are using, instead? The current chapter about unified shared memory is a good example. Are there other similar features that could be described in this tutorial?

I re-read this. I see what you're asking I think. Yes. Lemme iterate a bit and I'll come up with more things to add in for the similar feature set.

…an Compute chapter Address all feedback received thus far.

Add accompanying samples for all chapters.

SaschaWillems · 2026-06-11T19:37:23Z

Can you sync with master so the changes from #388 are available here?

gpx1000 · 2026-06-11T22:06:52Z

Sorry didn't notify you that I did the merge :). It should be good to go for that issue.

SaschaWillems · 2026-06-12T08:04:56Z

Thanks 👍🏻

Going through the samples (which are pretty interesting btw.). I noticed two issues.

The BVH one looks like geometry is missing (should be a full cornell box I guess):

The performance optimization sample seems to have issues properly displaying the time unit in the windows terminal:

…ce optimization to output ASCII instead of UTF-8 char.

gpx1000 · 2026-06-12T08:54:50Z

Both issues should be fixed.

SaschaWillems · 2026-06-12T12:13:03Z

Can you also add .gitignore entries for "attachments\compute" (similar to the SGE folder)? Otherwise git wants to check in all the build files e.g. from Visual Studio.

SaschaWillems · 2026-06-12T12:16:28Z

The BVH geometry is fixed, but it looks like that code has some sync issues. I can see randomly flickering artefacts depending on camera position:

Sync validation reports several write-after-read and write-after-write hazards.

SaschaWillems · 2026-06-12T12:20:48Z

Most of the samples report similar sync errors.

SaschaWillems · 2026-06-12T12:52:18Z

+            .regionCount    = 1,
+            .pRegions       = &region,
+            .filter         = vk::Filter::eNearest};
+        cb.blitImage2(blitInfo);


Afaik there is no guarantee that all image formats support blit (as dst), but I don't see code checking if the swapchain format supports blits. Applies to pretty much all of the compute samples.

SaschaWillems · 2026-06-12T16:10:59Z

+                .semaphoreCount = 1,
+                .pSemaphores    = &*timelineSemaphore,
+                .pValues        = &graphicsSignalVal};
+            auto r = device.waitSemaphores(swi, UINT64_MAX);


Doesn't that CPU wait cause a frame serialization, thus defeating any chance of graphics/compute overlap?

SaschaWillems · 2026-06-12T16:13:48Z

+        cb.bindPipeline(vk::PipelineBindPoint::eCompute, computePipeline);
+        cb.bindDescriptorSets(vk::PipelineBindPoint::eCompute,
+                              computePipelineLayout, 0,
+                              {computeDescSets[frameIndex]}, {});


Shouldn't this use the next frame index (after frame index) instead (for async/overlap).

SaschaWillems · 2026-06-12T16:15:46Z

+                      vk::MemoryPropertyFlags props,
+                      vk::raii::Buffer &buf, vk::raii::DeviceMemory &mem) const
+    {
+        buf = vk::raii::Buffer(device, {.size = size, .usage = usage, .sharingMode = vk::SharingMode::eExclusive});


This is using exclusive sharing mode, but I'm not seeing any queue ownership transfers. Either add them, or don't use exclusive sharing mode. Afaik it doesn't matter for buffers anyway.

SaschaWillems · 2026-06-12T16:17:24Z

+        auto &acqSem = acquireSemaphores[acquireSemIdx];
+        acquireSemIdx = (acquireSemIdx + 1) % (MAX_FRAMES + 1);
+
+        auto [acqResult, imageIndex] = swapChain.acquireNextImage(UINT64_MAX, *acqSem, nullptr);


Acquire result is never checked (e.g. out of date).

SaschaWillems · 2026-06-12T16:20:26Z

+//     family is preferred; the code falls back to a second queue from the
+//     graphics family, or to sharing queue 0 if only one queue is available.
+//
+//   - The GRAPHICS QUEUE renders a 64×64 cloth mesh as a lit triangle mesh


Minor, but comment says 64x64 mesh. while the constant a few lines below use 32x32.

SaschaWillems · 2026-06-12T16:24:28Z

+constexpr uint32_t WIDTH           = 1280;
+constexpr uint32_t HEIGHT          = 720;
+// Cloth is 32×32 = 1024 vertices so the entire grid fits in ONE compute
+// workgroup (maxComputeWorkGroupInvocations is guaranteed ≥ 1024 across all


Don't think that's true. Vulkan 1.0 core requires 128, Vk Roadmap 2022 / Vk 1.4 up this ti 256

SaschaWillems · 2026-06-12T16:26:03Z

+    // =======================================================================
+    bool isDeviceSuitable(vk::raii::PhysicalDevice const &pd)
+    {
+        bool ok13 = pd.getProperties().apiVersion >= VK_API_VERSION_1_3;


This only checks for 1.3 (and up), while createInstance requests 1.4

SaschaWillems · 2026-06-12T16:35:30Z

+                .semaphoreCount = 1,
+                .pSemaphores    = &*timelineSema,
+                .pValues        = &graphicsSignal};
+            if (device.waitSemaphores(swi, UINT64_MAX) != vk::Result::eSuccess)


Doesn't that host wait defeat frames in flight?

SaschaWillems · 2026-06-12T16:36:25Z

+            .applicationVersion = VK_MAKE_VERSION(1, 0, 0),
+            .pEngineName        = "No Engine",
+            .engineVersion      = VK_MAKE_VERSION(1, 0, 0),
+            .apiVersion         = vk::ApiVersion14};


This sample doesn't seem to use anything that requires Vulkan 1.4. I'd suggest lowering this to 1.3 to support a wider range of devices.

SaschaWillems · 2026-06-12T17:36:42Z

+*   **100% Occupancy**: The CU is completely packed with bundles. Whenever one waits for memory, there's almost certainly another one ready to go.
+*   **Low Occupancy**: Only a few bundles are active. If they all hit a memory fetch at the same time, the CU will stall.
+
+=== The Resource Tug-of-War


I actually had to look up what "Tug-of-War" means. This, and other similar terms used throughout the text, make it hard to read for non-native speaker. Maybe replace them with more common or actual technical terms.

SaschaWillems · 2026-06-12T17:41:09Z

@@ -0,0 +1,67 @@
+:pp: {plus}{plus}


This whole chapter feels overly generic and unspecific. What is a "modern AI assistent"? Which LLMs should be used?

An AI assistant can instantly recognize this

This could be true, this could be false. Without knowing what AI assistant was used, this statement can't be verified. Quality of LLM answers in regards to Vulkan from experience varies heavily.

Might be better to remove this for now and maybe add as a separate PR?

SaschaWillems · 2026-06-12T17:44:24Z

+
+This is where **GPU-Assisted Validation** (GAV) comes in. Part of the standard Vulkan Validation Layers, GAV works by injecting small amounts of diagnostic code directly into your shaders at runtime. This **instrumentation** allows the layers to track and report errors that would otherwise be invisible.
+
+=== Enabling GAV in C++


Should mention that GPU-AV can also be enabled via vkconfig.

SaschaWillems · 2026-06-12T17:44:43Z

+
+Debugging a compute shader is notoriously difficult. Unlike CPU code, you can't easily set a breakpoint or step through your logic line by line. Most errors—like an out-of-bounds buffer access—will simply result in garbage data or, in the worst-case scenario, a "Device Lost" error that provides almost no information about what went wrong.
+
+== GPU-Assisted Validation (GAV)


Minor, but validation layer documentation and UI us the GPU-AV as an abbreviation.

SaschaWillems · 2026-06-12T17:55:05Z

+* **FP16 (IEEE 754)**: 1 sign bit, 5 exponent bits, 10 mantissa bits. This provides decent precision but a very limited range (max value ~65,504).
+* **BFloat16 (Brain Float)**: 1 sign bit, 8 exponent bits, 7 mantissa bits. This has the *same range* as FP32 but much lower precision. It's often preferred for machine learning because it's more robust to overflows.
+
+In Vulkan, FP16 is widely supported via the `VK_KHR_shader_float16_int8` extension, while BFloat16 is typically accessed through the `VK_KHR_shader_float_controls` or vendor-specific extensions.


BFloat is supported via VK_KHR_shader_bfloat16

SaschaWillems · 2026-06-12T18:03:46Z

+
+If you've spent any time with Vulkan, you know the pain of **Descriptor Sets**. Managing layouts, updating pools, and binding sets before every draw or dispatch call is one of the most boilerplate-heavy parts of the API.
+
+But what if you didn't have to bind anything? What if you could just pass a raw 64-bit address to your shader and have it access the memory directly, just like a pointer in C{pp}? This is what **Buffer Device Address (BDA)** allows.


While the name "buffer device address" might imply this, it might still be worth noting that this only replaces descriptors for buffers and not images.

SaschaWillems · 2026-06-12T18:06:34Z

+
+== What is BDA?
+
+**Buffer Device Address** (available since Vulkan 1.2 and core in 1.4) allows you to query a 64-bit GPU address for any `VkBuffer`. This address is a raw pointer that can be stored in other buffers, passed to shaders via push constants, or even used to build complex, linked data structures across different memory regions.


"available since Vulkm 1.2..." is a bit misleading. They're available with 1.1 using the kHR extension. They're core since 1.2 and are mandatory since 1.3 (not 1.4)-.

SaschaWillems · 2026-06-12T18:12:46Z

+
+*   **Minimize Divergence**: If all 32 threads in a subgroup access the same texture, the hardware only needs to do one load. If all 32 threads access *different* textures, the load operation might take up to 32 times longer.
+*   **Subgroup Sorting**: If you have a large workload, consider sorting it so that threads in the same subgroup are more likely to access the same or nearby resources.
+*   **Vulkan 1.4 Features**: Modern Vulkan 1.4 hardware often has better support for non-uniform access, sometimes even avoiding the full scalarization loop for certain resource types.


Is this true? If so, cam this be a bit more specific? Like what is "Modern Vulkan 1.4 hardware"?

SaschaWillems · 2026-06-12T18:14:59Z

+
+=== The Explicit Nature of Vulkan
+
+On the CPU, you're used to a world where memory is generally **coherent**. If Thread A writes a value to a variable, Thread B can usually read it shortly after without any special ceremony because the hardware keeps the caches in sync automatically.


I don't think that's true. Coherency in the CPU world is more about what the different cores see, but you still need to properly sync writes and reads across cpu threads.

gpx1000 added 2 commits March 16, 2026 15:36

Fix formatting in Advanced Vulkan Compute conclusion section

b7fc0c9

Add missing blank lines after list introduction paragraphs to ensure proper Markdown rendering of bullet points in "Developing for advanced compute" and "Community and Resources" sections.

EwanC reviewed Mar 24, 2026

View reviewed changes

Comment thread en/Advanced_Vulkan_Compute/06_SYCL_and_Single_Source_CPP/02_setup_and_installation.adoc Outdated

bashbaug reviewed Mar 25, 2026

View reviewed changes

gpx1000 added 2 commits June 9, 2026 22:56

Remove SYCL and Single-Source C++ tutorial section from Advanced Vulk…

b96b2e7

…an Compute chapter Address all feedback received thus far.

Fix for antora navigation and all links.

e1eeaf0

Add accompanying samples for all chapters.

gpx1000 requested review from EwanC and bashbaug June 11, 2026 04:35

Merge branch 'upstream-main' into Vulkan-Compute-advanced-tutorial

b067454

Fix for the BVH cornell box missing geometry and update the performan…

eca4ae0

…ce optimization to output ASCII instead of UTF-8 char.

SaschaWillems reviewed Jun 12, 2026

View reviewed changes


		This is where GPU-Assisted Validation (GAV) comes in. Part of the standard Vulkan Validation Layers, GAV works by injecting small amounts of diagnostic code directly into your shaders at runtime. This instrumentation allows the layers to track and report errors that would otherwise be invisible.

		=== Enabling GAV in C++


		Debugging a compute shader is notoriously difficult. Unlike CPU code, you can't easily set a breakpoint or step through your logic line by line. Most errors—like an out-of-bounds buffer access—will simply result in garbage data or, in the worst-case scenario, a "Device Lost" error that provides almost no information about what went wrong.

		== GPU-Assisted Validation (GAV)


		If you've spent any time with Vulkan, you know the pain of Descriptor Sets. Managing layouts, updating pools, and binding sets before every draw or dispatch call is one of the most boilerplate-heavy parts of the API.

		But what if you didn't have to bind anything? What if you could just pass a raw 64-bit address to your shader and have it access the memory directly, just like a pointer in C{pp}? This is what Buffer Device Address (BDA) allows.


		== What is BDA?

		Buffer Device Address (available since Vulkan 1.2 and core in 1.4) allows you to query a 64-bit GPU address for any `VkBuffer`. This address is a raw pointer that can be stored in other buffers, passed to shaders via push constants, or even used to build complex, linked data structures across different memory regions.


		=== The Explicit Nature of Vulkan

		On the CPU, you're used to a world where memory is generally coherent. If Thread A writes a value to a variable, Thread B can usually read it shortly after without any special ceremony because the hardware keeps the caches in sync automatically.

Conversation

gpx1000 commented Mar 16, 2026

Uh oh!

Uh oh!

bashbaug left a comment

Choose a reason for hiding this comment

Uh oh!

gpx1000 commented Mar 25, 2026

Uh oh!

gpx1000 commented Mar 25, 2026

Uh oh!

SaschaWillems commented Jun 11, 2026

Uh oh!

gpx1000 commented Jun 11, 2026

Uh oh!

SaschaWillems commented Jun 12, 2026

Uh oh!

gpx1000 commented Jun 12, 2026

Uh oh!

SaschaWillems commented Jun 12, 2026

Uh oh!

SaschaWillems commented Jun 12, 2026

Uh oh!

SaschaWillems commented Jun 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants