Drawing to Image from Compute Shader - Example ?

Nasty_Nas · January 22, 2017, 8:07am

Hey
I’m a quite experienced programmer but pretty new to Vulkan & graphics programming in general.

I’m actually trying to draw images from a compute shader in realtime, but I can’t manage to find a good example.
The only one I’ve found is the Ray Tracer by Sascha https://github.com/SaschaWillems/Vulkan but there’s no explanation included & the example is based on a relatively huge code foundation with quite a lot of dependencies.

It would help me a lot if someone could provide an example or a source.

BTW, I didn’t expected too much sources for Vulkan in general but there seems to be almost nothing expect from this forum & Sascha’s repository in terms of Vulkan, although the API is now already public for around a year :sad:

Sascha_Willems · January 22, 2017, 11:22am

Rendering to an image using compute shaders is actually pretty straight-forward in Vulkan, with the shader part being the same as with OpenGL. In the end it’s no more than setting up a storage image that is sampled somewhere (with a proper barrier).

The computeshader example (https://github.com/SaschaWillems/Vulkan/tree/master/computeshader) from my repository is a bit more basic (more than the raytracing one) and if you know a bit about modern OpenGLit should be easy to understand. It’s also verbose (where relevant to the actual compute stuff) and commented where I though it would make sense. But as with my other samples it’s not aimed at graphics newcomers.

The steps for rendering to an image from a compute shader:

Create your compute target image with the VK_IMAGE_USAGE_STORAGE_BIT usage flag (check if the format suports that first)
Create and assign a descriptor of type VK_DESCRIPTOR_TYPE_STORAGE_IMAGE for that image for the compute pass
Add a binding for that image in your compute shader that fits the descriptor (e.g. layout ( binding = 0, rgba8 ) uniform writeonly image2D resultImage)
Create and assign a descriptor of type VK_DESCRIPTOR_TYPE_COMBINED_IMAGE (or separate sampler if you want) for that image for the rendering (sampling) pass
Write to the image in a compute command buffer submitted to the compute queue
Do proper synchronization (ensure that comptue shader writes are finished before sample)
Sample from the image in your graphics command buffer

krOoze · January 22, 2017, 12:30pm

Well, the API is relatively spartan and stateless compared to OpenGL. I personally found it easier to learn. Once you get a triangle to screen, you know almost everything.

I too dislike that basic tutorials tend to make their own smart abstractions hiding things from the learner. I pretty much need to see everything to understand anything.

You may try my code example:
https://github.com/krOoze/Hello_Triangle/tree/queue_transfer

I try to keep it relativelly flat and not too smart. The example purpose is to show EXCLUSIVE image transfer, but it does write “hello” on the image through compute shader.
It is also forked from the canonical Hello Triangle style app, which is in master. I think it is nice to do “compare” in GitHub to only see the things added/changed to make it a compute example.

Nasty_Nas · January 23, 2017, 8:31am

Hey & thanks for answering.
I actually didn’t expected to meet you here Sascha !

But as with my other samples it’s not aimed at graphics newcomers

Right, that wasn’t meant to be critique in any way.
However, thank you for listing the required steps. I’m going to try to set it up correctly this time following your advice.
Btw. Beste Grüße aus Köln.

I too dislike that basic tutorials tend to make their own smart abstractions hiding things from the learner.

I agree, that’s why I’ve followed the Tutorial by Alexander in the first instance. A lot of code for doing such a simple application in Vulkan(drawing a triangle) but it doesn’t hide anything that might be important.
Also definitly going to take a look at your sample code, thanks for that !

I just bought the official Vulkan Programming Guide so I guess I’m going to learn some more about computing with Vulkan anyways.

Nasty_Nas · January 23, 2017, 10:32am

I just wanted to take a quick look at the result of your example, it crashs with the error that no queue family could be found that supports computing.
My Physical Device is a GTX 970 & Sascha’s compute samples just run fine.

krOoze · January 23, 2017, 12:40pm

Right, that comes back to that example being primarily about EXCLUSIVE Image transfer.
To show that mechanism I need a compute queue family that is distinct from the graphics one. Alas your driver does not have separate compute queue family. I guess I don’t have an example specific to you (yet) then, sorry.

Nasty_Nas · February 1, 2017, 11:16am

Hey once again

I’ve read alot in the Programming Guide the past few days & managed to get it work, somehow.
It draws an image using a compute shader, but I can’t figure out how to do the synchronisation properly.

My compute shader writes to images from a swap chain, and I’ve been able to set up the presentation of swap chain images properly, so the next image isn’t presented until the first one finished with presentation.
But, I don’t understand how to block presenting an Image, until the compute shader has finished writing.
My approach:


void Application::Draw()
{
	auto result = vkAcquireNextImageKHR(logicalDevice, swapChain, std::numeric_limits<uint64_t>::max(), imageAvailableSemaphore, VK_NULL_HANDLE, &curImageIndex);

	VkSubmitInfo computeSubmitInfo = {};
        // Stuff

	vkResetFences(logicalDevice, 1, &computeFence);

        // set fence
	vkQueueSubmit(computeQueue, 1, &computeSubmitInfo, computeFence);

        // Wait for compute shader (compute pipeline) to finish work ?
	vkWaitForFences(logicalDevice, 1, &computeFence, VK_TRUE, UINT64_MAX);


	VkPresentInfoKHR presentInfo = {};
        // Stuff

	result = vkQueuePresentKHR(presentQueue, &presentInfo);
}

Soo, I thought with waiting for the fence to finish after submitting the queue would actually prevent Vulkan from presenting my image before my compute shader finished writing.
But the validation layers still say that I’m presenting an image before it has been filled with memory.

BTW:
I took again a look at @Sascha Willems compute shader sample and I don’t really understand what the graphics pipeline is required for ?
So, you draw a fullscreen quad, just to draw the image onto it. Why don’t you just present the image directly ?
I do this actually, there’s no graphics pipeline in my current implementation.

Sascha_Willems · February 1, 2017, 11:04pm

[QUOTE=Nasty Nas;41797]BTW:
I took again a look at @Sascha Willems compute shader sample and I don’t really understand what the graphics pipeline is required for ?
So, you draw a fullscreen quad, just to draw the image onto it. Why don’t you just present the image directly ?
I do this actually, there’s no graphics pipeline in my current implementation.[/QUOTE]

Exactly. The graphics pipeline is used to draw a fullscreen quad, nothing else. The reason I used a graphics pipeline was simply because there is no guarantee that all platforms supported by my examples support the VK_IMAGE_USAGE_STORAGE_BIT for the swap chain images.

As for your synchronisation:
A fence doesn’t suffice. You should have an image memory barrier to sync with proper stage flags. And depending on your queue setup you also may need to transfer image ownership in that barrier.

Nasty_Nas · February 2, 2017, 3:06pm

I really don’t get it.
I already had an image barrier before to transition the layout from VK_IMAGE_LAYOUT_GENERAL to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR.
The Programming Guide describes the individual Pipeline stages in which VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT is described as

Compute shader invocations produced as the
result of a dispatch have completed.

Still throws the same error

A small snippet from my compute command buffer command recording:


                
		vkCmdDispatch(computeCommandBuffer, swapChainExtent.width, swapChainExtent.height, 1);

		VkImageMemoryBarrier imageMemoryBarrier = {};
		imageMemoryBarrier.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
		imageMemoryBarrier.oldLayout = VK_IMAGE_LAYOUT_GENERAL;
		imageMemoryBarrier.newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR;
		imageMemoryBarrier.image = swapChainImages[curImage];
		imageMemoryBarrier.subresourceRange = { VK_IMAGE_ASPECT_COLOR_BIT, 0, 1, 0, 1 };
		imageMemoryBarrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
		imageMemoryBarrier.dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;

                // Start when the command is executed, finish when compute shader finished writing (?)
		vkCmdPipelineBarrier(computeCommandBuffer, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
				            0, 0, nullptr, 0, nullptr, 1, &imageMemoryBarrier);

Alfonse_Reinheart · February 2, 2017, 5:51pm

You seem to not be understanding how barriers work. The barrier sits between the source operation (the one doing the writing) and the receiving operation. It must do so both logically and in terms of processing stages.

So if a compute shader is writing, then the source ought to be VK_PIPELINE_STATE_COMPUTE_SHADER_BIT, since that’s what is actually doing the writing. In fact, the source stage should never should be “TOP_OF_PIPE”. The specification even warns you about this:

The destination stage mask may as well be VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT.

Nasty_Nas · February 3, 2017, 6:55am

Thanks for the explanation.
Just changed that for the future, but that doesn’t seem to be what the error caused.

I just noticed that it is always the second Image of the swap chain whose memory isn’t filled (I clamped the maxImageCount to 2 for testing)
I believe it wasn’t a synchronization problem, but a problem with my setup. I guess the compute shader simply doesn’t write to other images but the first at all.

However, I don’t really know if my setup - attempt makes sense at all.
I wasn’t able to find something regarding descriptor sets & swap chains setup.
So what I mean is, I’ve just allocated a descriptor set layout + descriptor set for each image in the swap chain. (Is this suitable ?)
And the layout binding in each descriptor set has the same binding index. So, the descriptor sets of swapChainImage [0] as well as swapChainImage [1] both have for example the binding index 0, could that cause my problems ?

Nasty_Nas · February 4, 2017, 5:37am

Alright so I just found out that creating a descriptor set for each image was the wrong way to go with.
I didn’t noticed that pImageInfo of vkWriteDescriptorSets accepts an array :doh:

So there’s no more error, every image seems to be filled with memory when presented.
Strangely, I still don’t receive the expected result though.
What I would expect is a continous blue screen (the compute shader just returns blue), instead, this black image is constantly thrown in between.
No error as you can see in the video below.
Oh, just to mention, this flickering, blue to black, is way faster in realtime. Its just slowed down in the video…
http://sendvid.com/3cs7yj6i

Nasty_Nas · February 5, 2017, 1:26pm

Sorry for triple posting but I want to “complete” the thread, since there isn’t much to find about vulkan in the web.

So, I finally solved my problem.
The problem was that I binded all descriptor sets into a single command buffer.
So, when I instead decided to create a descriptor for every image in the swap chain & bind them all into a single descriptor set & binded this set than into a single command buffer, it didn’t throw the exception anymore, that the image isn’t filled with memory, but it actually still really wasn’t filled with the result from my compute shader. That’s why you can see the black screen flickering in between.

To solve this, I went back to allocating a descriptor set with a single descriptor for each image in the swap chain.
But instead of binding all sets into a single command buffer, I am now binding a single set to a single command buffer, resulting in as many command buffers + descriptor sets as there are images in the swap chain.
The reason for this seems to be that vulkan (for some reason, I still don’t really know why) only returns the result of the compute shader to the descriptor, which means it only fills memory for the first image (since the descriptors bind the memory)

So, if you’re experiencing the same problem, you might want to allocate a seperate descriptor set and command buffer for each image in the swap chain.

Alfonse_Reinheart · February 5, 2017, 8:27pm

The reason for this seems to be that vulkan (for some reason, I still don’t really know why) only returns the result of the compute shader to the descriptor, which means it only fills memory for the first image (since the descriptors bind the memory)

A compute shader can only read from or write to resources defined by descriptors. Unlike graphics pipeline operations, there is no default destination for the process.

As such, if you create a static command buffer that uses a particular image resource defined by a specific descriptor, then it will always use that particular resource defined by that particular descriptor. These things don’t change unless you change them.

And if you’re thinking that you can call vkUpdateDescriptorSets to change the descriptor used by a static command buffer, you can’t. Why? Because the command buffer is static. You cannot modify any of the commands in a command buffer after recording it. You can change what data they fetch from, but the specific images they use cannot be altered.

So if you want to work with static CBs (for some reason), then you have to have different CBs for different use cases. Like fetching to/from different images. If you want to use a single static CB, then you should have the CB write to a user-provided image, then use a copy command to copy that image into the specific swap chain image you intend to use. Obviously that will likely be slower than manipulating the swap chain image directly.

But it’s also something you need to be aware of. Because swap chain images don’t have to be able to be used for arbitrary read/write operations. The bare minimum of usage flags an implementation is required to support is VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT. So there’s no guarantee that you can even copy to one, let alone write one directly from a compute shader.

Nasty_Nas · February 6, 2017, 8:18am

You got me wrong there.
I dind’t have a static Command Buffer.
And I didn’t update the sets to take more descriptors.

I allocated the single set directly with 3 descriptors.
So, at the time I was recording the commands for the command buffer (not a static command buffer, but just a single command buffer instead.) I directly binded one descriptor set containing three descriptors.

Alfonse_Reinheart · February 6, 2017, 10:02am

[QUOTE=Nasty Nas;41824]You got me wrong there.
I dind’t have a static Command Buffer.[/quote]

Then why do you need one command buffer for each descriptor set? If you’re re-recording the command buffer every frame, then you don’t need to have separate CBs for separate sets. You simply bind a different set when you render.

Unless your problem was that you’re not synchronizing your re-recording with the execution of the CB. That is, you try to record to a CB that’s currently being executed. But that’s a separate problem, and solving it does not require one CB for each descriptor set. It merely requires double-buffering and doing reasonable synchronization.

[QUOTE=Nasty Nas;41824]I allocated the single set directly with 3 descriptors.
So, at the time I was recording the commands for the command buffer (not a static command buffer, but just a single command buffer instead.) I directly binded one descriptor set containing three descriptors.[/QUOTE]

OK, but if your shader was not actually using “one descriptor set containing three descriptors”, why would you create that in C++? I mean, your shader would have to look something like this:


layout(set = 0, binding = 0) image2D firstImage;
layout(set = 0, binding = 1) image2D secondImage;
layout(set = 0, binding = 2) image2D thirdImage;

If your shader looks like this:


layout(set = 0, binding = 0) image2D theImage;

Then that’s what your VkDescrpitorSetLayout needs to conform to.

Also, I have no idea how the validation layers missed this, since they usually check compatibility between the pipeline layout and the actual shader code.

Nasty_Nas · February 7, 2017, 9:47am

No, I wasn’t re-recording at all (sounds like a bad idea to me)
My shader is configured the way you just wrote in your last code snippet, except for the set parameter.
I would have thought since every descriptor set uses the binding index 0, and there’s no set parameter (didn’t know that existed) the command buffer would automatically write to all 3 sets.

Also, I have no idea how the validation layers missed this

Don’t know either, maybe it is because I’m still using the SDK version 1.0.37

Nasty_Nas · March 21, 2017, 8:17am

Well, here I am once again… same topic.

So I was about to throw my little path tracer onto the GPU using Vulkan’s Compute Shaders & recalled everything I went through to be actually able to write to images via compute shader.
And…
It doesn’t seem to make sense. The answers I’ve received doesn’t seem to make sense. (No offense ! I just really think we’re getting each other wrong…)
Just to state it, I actually managed to get it work last time I wrote here. I just think it’s not at all a “clean” way.

What is currentlly happening:
-> I have a swapchain which consists of 3 Images on my Computer.
-> I have a compute shader listing a single Binding for output

layout (binding = 0, rgba8) uniform writeonly image2D resultImage;

→ I’m allocating a seperate descriptor set, each with a single descriptor, for each image in the swapchain.


for (int i = 0; i < swapChainImages.size(); i++)
{
  VkDescriptorSetLayoutBinding imgBinding = {};
  imgBinding.descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;
  imgBinding.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
  imgBinding.binding = 0;
  imgBinding.descriptorCount = 1;

  VkDescriptorSetLayoutCreateInfo layoutInfo = {};
  layoutInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
  layoutInfo.bindingCount = 1;
  layoutInfo.pBindings = imgBinding ;

  // computeDescriptorSets is just a vector of VkDescriptorSets
  vkCreateDescriptorSetLayout(logicalDevice, &layoutInfo, nullptr, computeDescriptorSets[i]);
}

→ Next I’m creating a Command Buffer for each image in the swapchain (3 in my case, as mentioned.) & record for each command buffer to bind with the corresponding descriptor set.

So… What’s the problem ?
Say I have a collection of spheres & planes that I want to pass to my compute shader, just like Sascha does in his raytracing example.
This means, that for every image in the swapchain I’ll have to bind those 2 buffers in the corresponding descriptor sets.

Making a total (for my computer) of 3 Descriptor Sets each containing a binding to the corresponding image, and 2 more bindings to the corresponding buffers.
Is this really the way to go ?
It just seems kind of… wrong to me.

Alright, let’s get to the part, that doesn’t seem to make sense for me (sorry for this looong post…)
→ A single binding (say for example 0) can actually take multiple descriptors !


VkDescriptorSetLayoutBinding imgBinding = {};
imgBinding.binding = 0;                                // Binding index
imgBinding.descriptorCount = 5;                    // <= Can take multiple descriptors !

Which actually means, I can bind multiple resources to it, right ? (I’m going to leave unimportant fields out…)


std::vector<VkDescriptorImageInfo> imgInfos;
for (int i = 0; i < swapChainImages.size(); i++)
{
   VkDescriptorImageInfo imgInfo = {};
   imgInfo.imageView = swapChainImageViews[i];      // Target current image view !

   imgInfos.push_back(imgInfo);
}

VkWriteDescriptorSet imgWrite = {};
imgWrite.dstSet = computeDescriptorSets[i];
imgWrite.dstBinding = 0;                                      // Binding index is still 0, same as above
imgWrite.descriptorCount = imgInfos.size();            // Target 3 descriptors 
imgWrite.pImageInfo = imgInfos.data();                 // Take 3 different resources (3 different image views)

So, either I’m just getting something awfully wrong … or, this is not very true:

OK, but if your shader was not actually using “one descriptor set containing three descriptors”, why would you create that in C++? I mean, your shader would have to look something like this:


layout(set = 0, binding = 0) image2D firstImage;
layout(set = 0, binding = 1) image2D secondImage;
layout(set = 0, binding = 2) image2D thirdImage;

Because, I can actually bind multiple resources to the same binding index of the same set, right ?
In my last example, all 3 Images would actually correspond to


layout (set = 0, binding = 0) image2D allThreeImages;

But, if I am really right, why doesn’t the compute shader fill all 3 images with memory than ?
Once again, sorry for this looong post !

Alfonse_Reinheart · March 21, 2017, 5:09pm

Is this really the way to go ?

You decided to use static command buffers. You decided to have these static CBs manifest data directly into swapchain images. Your current code is an an unavoidable consequence of those decisions.

If you want to have fewer static objects, you need to use fewer static objects.

Which actually means, I can bind multiple resources to it, right ?

Pursuant to applying correct usage of the descriptorCount:

There’s also a convenient code example in the specification about its use.

The specification is actually quite readable in most places; you should take a look at it.

So, in order to have multiple descriptors in a single binding, that binding must be arrayed in the shader. Of course, the validation layers should have pointed this fact out to you.

Furthermore, indexing a descriptor array with a non-constant index requires enabling an explicit GPU feature. Without this feature, you would need to use a compile-time constant. Which means you would separate pipelines, one for each swapchain image you want to render to.

So for hardware without the shaderStorageImageArrayDynamicIndexing feature, you would need one pipeline per image (probably using a specialization constant to pick which index to use).

Nasty_Nas · March 22, 2017, 9:09am

Alright, so I probably got you wrong there the first time.
With

static command buffer
you actually meant command buffers, that are only recorded once, right ? I was thinking in the sense of C++ static :doh:
I thought about that a while ago, however, it sounded for me to be quite performance heavy (I didn’t do any tests though) so I dismissed this idea immediately.
Guess I’ll go with that then.

The specification is actually quite readable in most places; you should take a look at it.

Most definitely !
To be honest I feel quite bad now.
Since the Vulkan Programming Guide is almost nothing but a reference for the API I just sticked with that all the time and I can tell you, there has been no word mentioned about this array stuff

So, in order to have multiple descriptors in a single binding, that binding must be arrayed in the shader. Of course, the validation layers should have pointed this fact out to you.

Nope, they didn’t throw anything, which makes sense now (I guess) since the specification states that the index can be ommited.

Thanks again for all the help (although I was acting quite dumb )