Hi!
I am working on a Vulkan Project where the render logic happens in a compute shader. I was inspired by this crazy post, but my rendering shader is a lot simpler.
I have it working right now in a single queue, but I’d like to improve performance by moving the compute rendering to a dedicated queue. Read on for details, and thanks in advance for wading through.
My compute shader writes to the image like so:
layout (binding = 0, rgba8) uniform image2D outImage;
void main() {
vec4 outColor = ...;
imageStore(outImage, ivec2(gl_GlobalInvocationID.xy), outColor);
}
And my command buffers (one for each swapchain image) look like the following (in rust, not C++, but hopefully should be obvious):
// start the buffer
// Bind to the compute pipeline.
self.device.cmd_bind_pipeline(...);
self.device.cmd_bind_descriptor_sets(...);
// This is an intermediate VkImage that the compute shader renders to. I have one for each swapchain image/command buffer.
let render_target = self.render_targets.get(i).unwrap().image;
// Transition the image to GENERAL for writing in the compute shader.
self.device.cmd_pipeline_barrier(
*cb,
vk::PipelineStageFlags::TOP_OF_PIPE, // We don't need to wait on anything.
vk::PipelineStageFlags::COMPUTE_SHADER,
vk::DependencyFlags::empty(),
...
&[vk::ImageMemoryBarrier::builder()
.image(render_target)
.old_layout(vk::ImageLayout::UNDEFINED)
.new_layout(vk::ImageLayout::GENERAL)
.src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.src_access_mask(vk::AccessFlags::empty())
.dst_access_mask(vk::AccessFlags::SHADER_WRITE)
...
],
);
self.device.cmd_dispatch(
*cb,
(self.dims.extent.width as f32 / TILE_SIZE as f32).ceil() as u32,
(self.dims.extent.height as f32 / TILE_SIZE as f32).ceil() as u32,
1,
);
// Transition the image to TRANSFER_SRC_OPTIMAL for blitting to the framebuffer.
self.device.cmd_pipeline_barrier(
*cb,
vk::PipelineStageFlags::COMPUTE_SHADER,
vk::PipelineStageFlags::TRANSFER,
...
&[vk::ImageMemoryBarrier::builder()
.image(render_target)
.old_layout(vk::ImageLayout::GENERAL)
.new_layout(vk::ImageLayout::TRANSFER_SRC_OPTIMAL)
.src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.src_access_mask(vk::AccessFlags::SHADER_WRITE)
.dst_access_mask(vk::AccessFlags::TRANSFER_READ)
...
],
);
// This is the swapchain image with the index matching the command buffer index.
let framebuffer_image = ...
// Transition the swapchain image to TRANSFER_DST_OPTIMAL for blitting.
self.device.cmd_pipeline_barrier(
*cb,
vk::PipelineStageFlags::COLOR_ATTACHMENT_OUTPUT,
vk::PipelineStageFlags::TRANSFER,
...
&[vk::ImageMemoryBarrier::builder()
.image(framebuffer_image)
.old_layout(vk::ImageLayout::UNDEFINED)
.new_layout(vk::ImageLayout::TRANSFER_DST_OPTIMAL)
.src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.src_access_mask(vk::AccessFlags::empty())
.dst_access_mask(vk::AccessFlags::TRANSFER_WRITE)
...
],
);
self.device.cmd_blit_image(...);
// Finally, transition the framebuffer image to PRESENT_SRC for
// presentation.
self.device.cmd_pipeline_barrier(
*cb,
vk::PipelineStageFlags::TRANSFER,
vk::PipelineStageFlags::COLOR_ATTACHMENT_OUTPUT,
...
&[vk::ImageMemoryBarrier::builder()
.image(framebuffer_image)
.old_layout(vk::ImageLayout::UNDEFINED)
.new_layout(vk::ImageLayout::PRESENT_SRC_KHR)
.src_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.dst_queue_family_index(vk::QUEUE_FAMILY_IGNORED)
.src_access_mask(vk::AccessFlags::TRANSFER_WRITE)
.dst_access_mask(vk::AccessFlags::COLOR_ATTACHMENT_READ)
...
],
);
// end the buffer
My presentation code is bog-standard, with one semaphore before the submit (waiting on acquireNextImage) and one after, which signals a present. As I said before, everything is submitted to a single queue, which is selected for supporting graphics, compute, AND present. I have one command buffer for each swapchain image, and I also have one render_target (intermediate VkImage) for each swapchain image, so that I can effectively double/tripper buffer them (this is one of the things I’m unsure about, however).
This is working well enough, but I’d like to understand how much work it would be to transition the compute rendering to its own queue, rather than submitting everything on one queue; I’ve read that often dedicated GPUs have their own compute queue.
How would I structure the synchronization for that?
Whether or not I switch compute rendering to its own queue, do I need to be swapping my render images? It seems like it should be possible in principle to tie up the render target just for the blit, which would be fast, and then let the presentation engine wait for a blank with the swapchain image while I start rendering again to the same memory.
Even more complicated: one thing I’d like to do is have different calls to dispatch render different parts of the screen, and then blit them onto different offsets. For that I’d need to somehow sync up all the “input” images and output image, blit the inputs on to the output. I find that pretty hard to wrap my head around.
I’d really appreciate some guidance on how to proceed. Thanks so much for the help!