Add Advanced Vulkan Compute tutorial #334
Conversation
…L, and conclusion Add comprehensive documentation covering Vulkan Memory Model (availability/visibility/domain operations), shared memory (LDS) with bank conflict details, memory consistency with GroupMemoryBarrierWithGroupSync, OpenCL C to SPIR-V pipeline (clspv), kernel portability guidelines, clvk layering, and tutorial conclusion. Include navigation entries for all new compute architecture sections.
Add missing blank lines after list introduction paragraphs to ensure proper Markdown rendering of bullet points in "Developing for advanced compute" and "Community and Resources" sections.
bashbaug
left a comment
There was a problem hiding this comment.
This is a nice tutorial but I think the way the "OpenCL on Vulkan" and "SYCL on Vulkan" topics are presented is a little confusing. If an OpenCL on Vulkan implementation has done its job, it would look and behave like any other OpenCL implementation. Note that we considered at one point whether OpenCL would need a "Vulkan profile" or similar for a layered implementation, but this does not seem to be required, and conformant implementations of OpenCL over Vulkan are shipping.
Would it make more sense to focus more on the advanced Vulkan compute features that these layered implementations are using, instead? The current chapter about unified shared memory is a good example. Are there other similar features that could be described in this tutorial?
|
Hi Ben, Thanks for the review. I can certainly remove the OpenCL and SYCL chapters and add more advanced features. We do have a VulkanML tutorial that covers some of the more advanced compute things that might go there instead (you can find that one in the gitlab MR for Vulkan-Tutorial. I was hoping to provide a method of exposing developers learning Vulkan to OpenCL and SYCL, This seemed like the right topic area to provide such intersection point. However, I don't feel like it would hurt Vulkan's compute tutorial as a concept to remove OpenCL and SYCL entirely. Alternatively, I could add more details about both and make it a longer tutorial. Do you have guidance for what you think would make sense? |
I re-read this. I see what you're asking I think. Yes. Lemme iterate a bit and I'll come up with more things to add in for the similar feature set. |
…an Compute chapter Address all feedback received thus far.
Add accompanying samples for all chapters.
|
Can you sync with master so the changes from #388 are available here? |
|
Sorry didn't notify you that I did the merge :). It should be good to go for that issue. |
…ce optimization to output ASCII instead of UTF-8 char.
|
Both issues should be fixed. |
|
Can you also add .gitignore entries for "attachments\compute" (similar to the SGE folder)? Otherwise git wants to check in all the build files e.g. from Visual Studio. |
|
Most of the samples report similar sync errors. |
| .regionCount = 1, | ||
| .pRegions = ®ion, | ||
| .filter = vk::Filter::eNearest}; | ||
| cb.blitImage2(blitInfo); |
There was a problem hiding this comment.
Afaik there is no guarantee that all image formats support blit (as dst), but I don't see code checking if the swapchain format supports blits. Applies to pretty much all of the compute samples.
| .semaphoreCount = 1, | ||
| .pSemaphores = &*timelineSemaphore, | ||
| .pValues = &graphicsSignalVal}; | ||
| auto r = device.waitSemaphores(swi, UINT64_MAX); |
There was a problem hiding this comment.
Doesn't that CPU wait cause a frame serialization, thus defeating any chance of graphics/compute overlap?
| cb.bindPipeline(vk::PipelineBindPoint::eCompute, computePipeline); | ||
| cb.bindDescriptorSets(vk::PipelineBindPoint::eCompute, | ||
| computePipelineLayout, 0, | ||
| {computeDescSets[frameIndex]}, {}); |
There was a problem hiding this comment.
Shouldn't this use the next frame index (after frame index) instead (for async/overlap).
| vk::MemoryPropertyFlags props, | ||
| vk::raii::Buffer &buf, vk::raii::DeviceMemory &mem) const | ||
| { | ||
| buf = vk::raii::Buffer(device, {.size = size, .usage = usage, .sharingMode = vk::SharingMode::eExclusive}); |
There was a problem hiding this comment.
This is using exclusive sharing mode, but I'm not seeing any queue ownership transfers. Either add them, or don't use exclusive sharing mode. Afaik it doesn't matter for buffers anyway.
| auto &acqSem = acquireSemaphores[acquireSemIdx]; | ||
| acquireSemIdx = (acquireSemIdx + 1) % (MAX_FRAMES + 1); | ||
|
|
||
| auto [acqResult, imageIndex] = swapChain.acquireNextImage(UINT64_MAX, *acqSem, nullptr); |
There was a problem hiding this comment.
Acquire result is never checked (e.g. out of date).
| // family is preferred; the code falls back to a second queue from the | ||
| // graphics family, or to sharing queue 0 if only one queue is available. | ||
| // | ||
| // - The GRAPHICS QUEUE renders a 64×64 cloth mesh as a lit triangle mesh |
There was a problem hiding this comment.
Minor, but comment says 64x64 mesh. while the constant a few lines below use 32x32.
| constexpr uint32_t WIDTH = 1280; | ||
| constexpr uint32_t HEIGHT = 720; | ||
| // Cloth is 32×32 = 1024 vertices so the entire grid fits in ONE compute | ||
| // workgroup (maxComputeWorkGroupInvocations is guaranteed ≥ 1024 across all |
There was a problem hiding this comment.
Don't think that's true. Vulkan 1.0 core requires 128, Vk Roadmap 2022 / Vk 1.4 up this ti 256
| // ======================================================================= | ||
| bool isDeviceSuitable(vk::raii::PhysicalDevice const &pd) | ||
| { | ||
| bool ok13 = pd.getProperties().apiVersion >= VK_API_VERSION_1_3; |
There was a problem hiding this comment.
This only checks for 1.3 (and up), while createInstance requests 1.4
| .semaphoreCount = 1, | ||
| .pSemaphores = &*timelineSema, | ||
| .pValues = &graphicsSignal}; | ||
| if (device.waitSemaphores(swi, UINT64_MAX) != vk::Result::eSuccess) |
There was a problem hiding this comment.
Doesn't that host wait defeat frames in flight?
| .applicationVersion = VK_MAKE_VERSION(1, 0, 0), | ||
| .pEngineName = "No Engine", | ||
| .engineVersion = VK_MAKE_VERSION(1, 0, 0), | ||
| .apiVersion = vk::ApiVersion14}; |
There was a problem hiding this comment.
This sample doesn't seem to use anything that requires Vulkan 1.4. I'd suggest lowering this to 1.3 to support a wider range of devices.
| * **100% Occupancy**: The CU is completely packed with bundles. Whenever one waits for memory, there's almost certainly another one ready to go. | ||
| * **Low Occupancy**: Only a few bundles are active. If they all hit a memory fetch at the same time, the CU will stall. | ||
|
|
||
| === The Resource Tug-of-War |
There was a problem hiding this comment.
I actually had to look up what "Tug-of-War" means. This, and other similar terms used throughout the text, make it hard to read for non-native speaker. Maybe replace them with more common or actual technical terms.
| @@ -0,0 +1,67 @@ | |||
| :pp: {plus}{plus} | |||
There was a problem hiding this comment.
This whole chapter feels overly generic and unspecific. What is a "modern AI assistent"? Which LLMs should be used?
An AI assistant can instantly recognize this
This could be true, this could be false. Without knowing what AI assistant was used, this statement can't be verified. Quality of LLM answers in regards to Vulkan from experience varies heavily.
Might be better to remove this for now and maybe add as a separate PR?
|
|
||
| This is where **GPU-Assisted Validation** (GAV) comes in. Part of the standard Vulkan Validation Layers, GAV works by injecting small amounts of diagnostic code directly into your shaders at runtime. This **instrumentation** allows the layers to track and report errors that would otherwise be invisible. | ||
|
|
||
| === Enabling GAV in C++ |
There was a problem hiding this comment.
Should mention that GPU-AV can also be enabled via vkconfig.
|
|
||
| Debugging a compute shader is notoriously difficult. Unlike CPU code, you can't easily set a breakpoint or step through your logic line by line. Most errors—like an out-of-bounds buffer access—will simply result in garbage data or, in the worst-case scenario, a "Device Lost" error that provides almost no information about what went wrong. | ||
|
|
||
| == GPU-Assisted Validation (GAV) |
There was a problem hiding this comment.
Minor, but validation layer documentation and UI us the GPU-AV as an abbreviation.
| * **FP16 (IEEE 754)**: 1 sign bit, 5 exponent bits, 10 mantissa bits. This provides decent precision but a very limited range (max value ~65,504). | ||
| * **BFloat16 (Brain Float)**: 1 sign bit, 8 exponent bits, 7 mantissa bits. This has the *same range* as FP32 but much lower precision. It's often preferred for machine learning because it's more robust to overflows. | ||
|
|
||
| In Vulkan, FP16 is widely supported via the `VK_KHR_shader_float16_int8` extension, while BFloat16 is typically accessed through the `VK_KHR_shader_float_controls` or vendor-specific extensions. |
There was a problem hiding this comment.
BFloat is supported via VK_KHR_shader_bfloat16
|
|
||
| If you've spent any time with Vulkan, you know the pain of **Descriptor Sets**. Managing layouts, updating pools, and binding sets before every draw or dispatch call is one of the most boilerplate-heavy parts of the API. | ||
|
|
||
| But what if you didn't have to bind anything? What if you could just pass a raw 64-bit address to your shader and have it access the memory directly, just like a pointer in C{pp}? This is what **Buffer Device Address (BDA)** allows. |
There was a problem hiding this comment.
While the name "buffer device address" might imply this, it might still be worth noting that this only replaces descriptors for buffers and not images.
|
|
||
| == What is BDA? | ||
|
|
||
| **Buffer Device Address** (available since Vulkan 1.2 and core in 1.4) allows you to query a 64-bit GPU address for any `VkBuffer`. This address is a raw pointer that can be stored in other buffers, passed to shaders via push constants, or even used to build complex, linked data structures across different memory regions. |
There was a problem hiding this comment.
"available since Vulkm 1.2..." is a bit misleading. They're available with 1.1 using the kHR extension. They're core since 1.2 and are mandatory since 1.3 (not 1.4)-.
|
|
||
| * **Minimize Divergence**: If all 32 threads in a subgroup access the same texture, the hardware only needs to do one load. If all 32 threads access *different* textures, the load operation might take up to 32 times longer. | ||
| * **Subgroup Sorting**: If you have a large workload, consider sorting it so that threads in the same subgroup are more likely to access the same or nearby resources. | ||
| * **Vulkan 1.4 Features**: Modern Vulkan 1.4 hardware often has better support for non-uniform access, sometimes even avoiding the full scalarization loop for certain resource types. |
There was a problem hiding this comment.
Is this true? If so, cam this be a bit more specific? Like what is "Modern Vulkan 1.4 hardware"?
|
|
||
| === The Explicit Nature of Vulkan | ||
|
|
||
| On the CPU, you're used to a world where memory is generally **coherent**. If Thread A writes a value to a variable, Thread B can usually read it shortly after without any special ceremony because the hardware keeps the caches in sync automatically. |
There was a problem hiding this comment.
I don't think that's true. Coherency in the CPU world is more about what the different cores see, but you still need to properly sync writes and reads across cpu threads.



sections on memory models OpenCL interoperation and SYCL interoperation.
Add comprehensive documentation covering Vulkan Memory Model (availability/visibility/domain operations), shared memory (LDS) with bank conflict details, memory consistency with GroupMemoryBarrierWithGroupSync, OpenCL C to SPIR-V pipeline (clspv), kernel portability guidelines, clvk layering, and tutorial conclusion. Include navigation entries for all new compute architecture sections.