VkCmdCopyBuffer details

Is it true there is no guarantee of the order in which CopyBuffer operations are actually carried out? Let’s say I need to resize a buffer and add more data to it. How can I ensure the data in the new buffer is added after the old buffer contents are first copied over? Would I use a memory barrier that uses VK_ACCESS_TRANSFER_WRITE_BIT for both the source and destination access mask?

Another approach could be to have a staging buffer associated with each GPU buffer. This would make it easier to keep data in sync but it would double the amount of VRAM a program uses. (Actually, since each command buffer needs a separate staging buffer, now we are talking about quadruple buffering…)

That’s true of pretty much everything in Vulkan. If the documentation for a command does not specify any ordering, and it’s not one of the few other cases of implicit ordering, then ordering is not guaranteed.

Why do you need to? The new data you’re adding doesn’t overlap with the old data, right?

It could. Let’s say the user does this:

  1. Adds a new mesh that makes the global vertex data bigger than the capacity of the buffer.
  2. Deleted a mesh that resides within the existing buffer size.
  3. Creates a new mesh, and the space from the deleted buffer is reallocated for the new mesh.

So what needs to happen is this:

  1. New (bigger) buffer is created.
  2. Contents of old buffer are copied into new buffer, along with new mesh that caused the resize, into the new area of the bigger buffer.
  3. The second created mesh data is copied into the space that was reallocated from the deleted mesh.

Any weird sequence of events like this could happen. It’s not an impossible problem, but I am trying to come up with a good general solution that is not too complicated and will give good results.

I suppose I could create a new queue for the buffer resize. This is a rare event. Are queues guaranteed to be executed in order?

That seems like a wasted opportunity to me. Let’s break down your example.

You have 3 meshes in the original buffer: A, B and C, in that order and tightly packed. Now, you want to add mesh D, but there’s not enough space for it. So you need a reallocation.

But “at the same time”, you independently decide that B is no longer needed.

The correct way to do this is to copy A, copy C, and copy D, so that the new buffer contains A, C, and D.

If you do it this way, you won’t have a B-shaped hole in the middle of your mesh data. Or more specifically, a sizeof(D) - sizeof(B) shaped hole. It’s all nice and compact, just like it was when you started.

Now, doing this if you’re just deleting B is a bad idea; it’s faster to just leave the hole there. And if you’ve already deleted B, and D will fit into B’s former storage, it’s faster to just copy D over B’s data.

But once you decide that you need to do reallocation, it’s best to get rid of any holes in your storage. You’re already having to pay the cost of waiting on a memory transfer operation, so it makes sense to take a little extra time and optimize your memory layout.

My main point is that, if you do things this way, there are no overlapping writes (that is, no attempts to write to the same memory). The copy of A doesn’t overlap with the copy of C, and neither overlap with the copy of D. You can do GPU-to-GPU copies of A and C, then update the memory for D via a staging buffer or just mapped memory writes. There are no data races.

Well, there is also the scenario where you have A-E and then B and D are deleted, in which case you will need a memory barrier in between each “hole closure”. And the need to re-assign integer handles that have already been passed out. I could use an object that hides the internal memory position, but what I have is working now. The allocation routine does search for holes in the memory and tries to find a space big enough to fit new meshes before it tacks them onto the end. Over time, memory will develop small gaps between mesh data, but a scene change will result in most of the data being freed up so the whole thing gets reset anyways.

You did however give me another idea. In my initial implementation, memory was allocated as commands came in from the main thread, in whatever order the programmer calls them, but I deferred all the memory changes so that all deletions occur first, followed by all additions. So if the user adds 100 meshes and removes 100 meshes (assuming they are all the same size) then the memory buffer will not change size.

The memory barriers are working very well now and everything seems good: