Uniform Buffer Memory Barriers

Silverlan · March 3, 2016, 2:41pm

I have a simple shader with a MVP matrix uniform and several objects. Each object has its own MVP matrix, and there is one buffer/memory for the MVP descriptor.
The problem is, if I render all objects, they all end up using the MVP matrix of the very last object in the render pass. My approach right now is this:

Bind the shader pipeline
Map the memory for the MVP buffer, write the object matrix, unmap the memory
Bind the descriptor set for the MVP uniform
Bind the object vertex buffer
Run the draw-call
Repeat for all objects from step 2)

I’m assuming that, because the draw-call is delayed, the MVP-buffer is overwritten with the MVP for the next object, before rendering of the previous object has finished.
How can I prevent that from happening? I’ve tried inserting a memory barrier right after the draw-call, with VK_ACCESS_UNIFORM_READ_BIT as source mask and VK_ACCESS_HOST_WRITE_BIT as destination mask (=“All uniform reads should finish before the host tries to write to the memory again”), that didn’t seem to have any effect however.
I’m basically trying to achieve what glUniform does in OpenGL.

ratchet_freak · March 4, 2016, 3:22am

Add all the matrices into the buffer and select from them them at draw time. So your MVP buffer has several matrices one after the other.

When binding the decriptor set add the offset of the current matrix.

Opengl drivers most likely use ring buffers for the uniforms of draw calls (or the push constants depending on the hardware and amount of data).

bsupnik · March 4, 2016, 6:45pm

Right … this is in the category of “hey, maybe that OpenGL thing wasn’t so bad after all…it sure did a lot of stuff for me.”

Seriously, the GL driver was probably “windowing” the uniform for you - that is, allocating a larger buffer and writing each successive uniform into sequential memory until the buffer is filled.

So what you need to do is:

As you write commands, maintain a pointer into your buffer for the next free 16-float “slot” for a matrix.
For each draw call, you need to provide a way for the shader to get at the right part of the buffer. It looks like there are two choices:
2a. Use a single integer push constant to specify which matrix you want. The push constants can be updated between draw calls cheaply with vkCmdPushConstants.
2b. Use a dynamic uniform descriptor in your descriptor set and use vkCmdBindDescriptorSets to rebind your descriptor sets with new dynamic offsets.

I don’t know how fast 2b will be, but if I understand the push constant pathway, it is possibly the fastest way to get a single number to a single draw call.

(Theoretically you could dump a uniform in a push constant, but this is usually not a great idea for anything other than indices - I think the push constant budget is limited.)

There’s a second design decision once you get a single windowed matrix working. When you get a second uniform, what do you do?

Put both uniforms next to each other in a single buffer. You’ll need to copy both uniforms each time either one changes, but you only need one index.
Use two separate buffers, one for each uniform, and use two indices.

This is a trade-off between uniform update overhead and using more push constants; if you can group your uniforms by when they tend to change, you can get a win.

I don’t know the limits on how many separate dynamic offsets can be in a descriptor set or how expensive they are.

Cheers
Ben

Silverlan · March 4, 2016, 10:30pm

Thanks, I’ve already tried using a single buffer with method 2b, which worked quite well.
2a sounds interesting, I’ll give that a try and just compare performance.
Is there any way to find out how many push constants there can be?

Also, do I still need memory barriers with these methods?

Alfonse_Reinheart · March 4, 2016, 10:36pm

You do not need memory barriers, per se. But you do need to ensure memory coherency, as with any other mapping operation. So you either map coherently or you map incoherently and flush those memory ranges you’ve mapped.

Silverlan · March 5, 2016, 7:32am

I’ve read up on push constants, however I’m unsure how I could use them to select the correct matrix.
The push constants can be accessed inside the shader using:


layout(push_constant) uniform pushBlock {
	int matrixId;
} pushConstantsBlock;

But where would I select the actual matrix from? I’d need multiple matrices in the shader (one for each object that needs to be rendered), and that’s just not feasible in my case, since there can be several hundred objects at once. Maybe I’m missing something here?

Anyway, method 2b works splendid for generic object data (MVP Matrix, Color, etc), however I’m unsure what to do about textures (Image +Image View +Sampler).
Basically, it’s the same issue: Two objects have different textures, both end up being rendered with the texture of the very last object in the render pass.
I’m using VkUpdateDescriptorSets to update the texture descriptor set with the object’s texture, right before each object is drawn.

I can’t use method 2b, since I can’t store images in a sequential buffer. I can’t find any examples in the Vulkan SDK with multiple textures either.
What now?

Alfonse_Reinheart · March 5, 2016, 8:23am

You would select the matrix from an array of all matrices you intend to use.

This is pretty standard stuff in high-performance rendering. You change a single index value per-object, which is used by the shader to fetch that object’s per-object data from buffers. It can also be used to fetch textures from array textures or arrays of samplers.

The InstanceIndex is a popular choice; you can set the base instance for each draw command in a multidraw indirect command. Or rather, it would have been in OpenGL if gl_InstanceID worked (in OpenGL, it doesn’t include the base instance). Fortunately, Vulkan’s InstanceIndex does work for this.

I’m using VkUpdateDescriptorSets to update the texture descriptor set with the object’s texture, right before each object is drawn.

OK, Vulkan is very well designed for performance. That is, the API makes it abundantly clear when you are about to do something that is not good for performance. Just look at the renderpass stuff; does it seem to you like it’s reasonably fast to change renderpasses frequently? Of course not; it’s harder to change renderpasses (requiring lots of pipelines, since they’re all linked to the renderpass) than it is to change subpasses.

Subpasses are cheap; renderpasses are expensive.

Similarly, which will probably take more time: VkUpdateDescriptorSets which changes the entire descriptor set, or using a push-constant, which the shader uses to select which texture to sample from, thus allowing you to have a single descriptor set which remains static across multiple rendering calls?

OpenGL makes the former (binding a texture) simpler and easier to use than the latter. But the latter is almost certainly faster. Vulkan makes it clear which is the better choice for performance.

You seem familiar with standard OpenGL-style development. You should take some time to look at the AZDO presentation, which effectively lays out how Vulkan-style development and structure ought to be done (by using OpenGL features to emulate it).

You should not attempt to code Vulkan applications the way you would OpenGL applications. Vulkan exists for performance; if you’re not willing to do things the fast way, you shouldn’t be using it at all.

I can’t use method 2b, since I can’t store images in a sequential buffer.

You don’t need to store them in a sequential buffer. You store them in an array of images. They all get one binding location, and your descriptor layout would contain an array of images for that location.

Alfonse_Reinheart · March 5, 2016, 8:56am

[QUOTE=Alfonse Reinheart;39902]Similarly, which will probably take more time: VkUpdateDescriptorSets which changes the entire descriptor set, or using a push-constant, which the shader uses to select which texture to sample from, thus allowing you to have a single descriptor set which remains static across multiple rendering calls?

OpenGL makes the former (binding a texture) simpler and easier to use than the latter. But the latter is almost certainly faster. Vulkan makes it clear which is the better choice for performance.[/quote]

FYI: if you can’t do that (due to lack of resources in the hardware. That is, it doesn’t let you make arrays big enough), then at least stop updating the descriptor sets. Give each object (or group of objects) its own set and switch between them.

Dynamically changing the contents of a descriptor set is not something you should frequently be doing.