Staging texture with host-coherent memory not always copied to the target correctly

Hello, I’ll try to explain my setup here:

For all graphics commands, I have some API-agnostic interface that the application itself uses. For the textures, there are two specific cases that are relevant for this issue:

  1. One is ‘device local’, which is located on device memory and is either shader-read-only, or general;
  2. Another is titled “CPU read-only”, which resides on host-coherent memory and the main use-case is client-side staging buffers and CPU read-back.

Despite the difference, both implement the same Map()/Unmap(write) interface. The host-coherent one just returns the mapped memory address on Map() and pretty much doing nothing on Unmap(); all coherent memory stays mapped, because the larger allocations are potentially shared between buffers/textures; I have tried invoking vkFlushMappedMemoryRanges even if it should not be needed.

On the other hand, whenever the client code ‘Maps’ the device-local variant, under the hood a new host-coherent buffer (not texture) is created and it’s mapped memory is returned; the Unmap(write) call then either discards the buffer if write is not requested, or creates a “one-time” command buffer, invokes vkCmdCopyBufferToImage and submits to the queue without waiting on execution; those one-off command buffers are later waited on by an external thread and the staging buffer is discarded once the execution is complete.

Similar interface also exists for the buffers, even if usually, the project tries to use “manual” staging buffers when possible to avoid unnecessary command buffer creation, with a few exceptions like asynchronous texture loading and some other one-off transactions that happen infrequently and up until the previous week, I had not experienced any issues.

Having said that, here comes the problem I encountered recently:
For some atlas-type texture, I needed to occasionally update certain chunks;
For that, I create a staging texture, map it fill the pixels and unmap;
Next step is copying relevant texture regions from the staging buffer to the atlas using an external command buffer.
(Just to be thorough, I have layout transitions for the source/target textures before and after the copy region and transitioning to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL requests VK_PIPELINE_STAGE_HOST_BIT with VK_ACCESS_HOST_WRITE_BIT as source stage&mask amongst others)

Problem is, that for some strange reason, if the staging texture is allocated as host-coherent, roughly 1 in 10 times the regions are not copied in their entirety, as if the GPU was reading some cached data during the copy operation. If I create a “device-local”, texture, the copy always works. Operation is infrequent enough that I do not mind a small memory & command buffer overhead, but I’d rather trust my API to do whatever it’s supposed to do…

Can anyone suggest what might I be missing here? Maybe there’s a specific barrier I need to add somewhere?

Are you doing proper synchronization? Specifically, are you doing anything to prevent writing to the memory from the CPU while the GPU is reading from it?

What do the validation layers say about your operations?

The staging buffer exist as long as the command buffer needs it. While it exists, no other buffer can access the same memory and no other part of the code has the reference to the buffer.
That makes it more or less impossible for any CPU thread to write to that chunk of memory.

And the way my allocator works, I can tell with a fair bit of confidence that regardless if I use an internal buffer or a staging texture, actual memory allocated and manipulated will be the same (I wish it was not that easy to say that)

Now what struck me is that right at the moment the textures are created, a one-time command buffer is executed that makes the initial layout transition from undefined. It’s all right if it happens for device-resident ones, but for the host-coherent we may actually need to wait on that buffer execution for it to be safe. I’ll update the code and check

Update: By the looks of it, this was the actual issue. It does not break any more after the change.
Unfortunately for me, that means using that staging buffer will be a better alternative in my case, since it does not require halting anything, but I’ll deal with it :smiley: .

Anyway, @Alfonse_Reinheart thanks for your response. Even if the CPU access was not the issue, the question lead me down the correct rabbit hole :smiley: .