Decoupling Compute Shader Rendering

Hi!

I am working on a Vulkan Project where the render logic happens in a compute shader. I was inspired by this crazy post, but my rendering shader is a lot simpler.

I have it working right now in a single queue, but I’d like to improve performance by moving the compute rendering to a dedicated queue. Read on for details, and thanks in advance for wading through.

My compute shader writes to the image like so:

layout (binding = 0, rgba8) uniform image2D outImage;

 void main() {
    vec4 outColor = ...;
	imageStore(outImage, ivec2(gl_GlobalInvocationID.xy), outColor);
}

And my command buffers (one for each swapchain image) look like the following (in rust, not C++, but hopefully should be obvious):

// start the buffer

// Bind to the compute pipeline.
self.device.cmd_bind_pipeline(...);

self.device.cmd_bind_descriptor_sets(...);

// This is an intermediate VkImage that the compute shader renders to. I have one for each swapchain image/command buffer.
let render_target = self.render_targets.get(i).unwrap().image;

// Transition the image to GENERAL for writing in the compute shader.
self.device.cmd_pipeline_barrier(
    *cb,
    vk::PipelineStageFlags::TOP_OF_PIPE, // We don't need to wait on anything.
    vk::PipelineStageFlags::COMPUTE_SHADER,
    vk::DependencyFlags::empty(),
    ...
    &[vk::ImageMemoryBarrier::builder()
        .image(render_target)
        .old_layout(vk::ImageLayout::UNDEFINED)
        .new_layout(vk::ImageLayout::GENERAL)
        .src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .src_access_mask(vk::AccessFlags::empty())
        .dst_access_mask(vk::AccessFlags::SHADER_WRITE)
        ...
        ],
);

self.device.cmd_dispatch(
    *cb,
    (self.dims.extent.width as f32 / TILE_SIZE as f32).ceil() as u32,
    (self.dims.extent.height as f32 / TILE_SIZE as f32).ceil() as u32,
    1,
);

// Transition the image to TRANSFER_SRC_OPTIMAL for blitting to the framebuffer.
self.device.cmd_pipeline_barrier(
    *cb,
    vk::PipelineStageFlags::COMPUTE_SHADER,
    vk::PipelineStageFlags::TRANSFER,
    ...
    &[vk::ImageMemoryBarrier::builder()
        .image(render_target)
        .old_layout(vk::ImageLayout::GENERAL)
        .new_layout(vk::ImageLayout::TRANSFER_SRC_OPTIMAL)
        .src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .src_access_mask(vk::AccessFlags::SHADER_WRITE)
        .dst_access_mask(vk::AccessFlags::TRANSFER_READ)
       ...
       ],
);

// This is the swapchain image with the index matching the command buffer index.
let framebuffer_image = ...

// Transition the swapchain image to TRANSFER_DST_OPTIMAL for blitting.
self.device.cmd_pipeline_barrier(
    *cb,
    vk::PipelineStageFlags::COLOR_ATTACHMENT_OUTPUT,
    vk::PipelineStageFlags::TRANSFER,
    ...
    &[vk::ImageMemoryBarrier::builder()
        .image(framebuffer_image)
        .old_layout(vk::ImageLayout::UNDEFINED)
        .new_layout(vk::ImageLayout::TRANSFER_DST_OPTIMAL)
        .src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .src_access_mask(vk::AccessFlags::empty())
        .dst_access_mask(vk::AccessFlags::TRANSFER_WRITE)
        ...
        ],
);

self.device.cmd_blit_image(...);

// Finally, transition the framebuffer image to PRESENT_SRC for
// presentation.
self.device.cmd_pipeline_barrier(
    *cb,
    vk::PipelineStageFlags::TRANSFER,
    vk::PipelineStageFlags::COLOR_ATTACHMENT_OUTPUT,
    ...
    &[vk::ImageMemoryBarrier::builder()
        .image(framebuffer_image)
        .old_layout(vk::ImageLayout::UNDEFINED)
        .new_layout(vk::ImageLayout::PRESENT_SRC_KHR)
        .src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
        .src_access_mask(vk::AccessFlags::TRANSFER_WRITE)
        .dst_access_mask(vk::AccessFlags::COLOR_ATTACHMENT_READ)
        ...
        ],
);

// end the buffer

My presentation code is bog-standard, with one semaphore before the submit (waiting on acquireNextImage) and one after, which signals a present. As I said before, everything is submitted to a single queue, which is selected for supporting graphics, compute, AND present. I have one command buffer for each swapchain image, and I also have one render_target (intermediate VkImage) for each swapchain image, so that I can effectively double/tripper buffer them (this is one of the things I’m unsure about, however).

This is working well enough, but I’d like to understand how much work it would be to transition the compute rendering to its own queue, rather than submitting everything on one queue; I’ve read that often dedicated GPUs have their own compute queue.

How would I structure the synchronization for that?

Whether or not I switch compute rendering to its own queue, do I need to be swapping my render images? It seems like it should be possible in principle to tie up the render target just for the blit, which would be fast, and then let the presentation engine wait for a blank with the swapchain image while I start rendering again to the same memory.

Even more complicated: one thing I’d like to do is have different calls to dispatch render different parts of the screen, and then blit them onto different offsets. For that I’d need to somehow sync up all the “input” images and output image, blit the inputs on to the output. I find that pretty hard to wrap my head around.

I’d really appreciate some guidance on how to proceed. Thanks so much for the help!

Why do you think that would improve performance? I mean, presumably you’re rendering stuff other than particles, yes? You can’t render to the same image from different queues at the same time. So what do you intend to gain by moving the CS to a different queue?

The point of compute-only queues is to be able to submit compute tasks whose results either are not used for rendering at all or whose results are not used for rendering the current frame being worked on for graphics (or sometimes for graphics operations that will execute later in the frame, but that requires lots of graphics operations that don’t depend on the CS results yet). If you have a compute queue batch that can’t start its work until a graphics operation is done, and a graphics batch that can’t start its work until a compute operation is done, they may as well be on the same queue.

For graphics-compute dependencies you just need Semaphores instead of a barrier. And resources used across those queue families need Queue Family Ownership Transfer, or VK_SHARING_MODE_CONCURRENT. As Alfonse says the trick is not the naive transition of the code. The trick is to get the benefits out of this.

I’m only rendering from the compute shader and presenting - no graphics pipeline at all, except that BlitImage seems to need a queue with graphics support. So for a start, I’d expect it to improve performance for the same reason that standard graphics applications present in the different queue (which, tbh, I don’t completely understand).

presumably you’re rendering stuff other than particles, yes?

I don’t understand why that matters, but I’d like to. Could you elaborate?

The point of compute-only queues is to be able to submit compute tasks whose results either are not used for rendering at all or whose results are not used for rendering the current frame being worked on for graphics.

So is the distinction between simulation and rendering? How would you simulate something (let’s say the position of some particles) and write the result to a buffer without synchronizing access to that buffer from the graphics queue? Sorry if I’m being thick.

Thanks again for taking the time to respond.

… they do? I’ve never heard of programs doing that.

Because queues execute asynchronously, unless you explicitly synchronize operations between them. If both the rendering operation and your CS operation attempt to manipulate the same image, that will only work if you have synchronization between the two operations, to prevent them from both executing simultaneously. And if the two operations cannot both execute at the same time, there’s no point in submitting them to different queues, since they can’t overlap execution.

You would, but you would do the synchronization next frame. So while the CS operation is generating those positions, the graphics queue can be executing rendering commands that don’t use the data the CS operation is generating.

It’s exactly like double-buffering: you don’t render to the image currently being displayed.