Clarification on coherent memory host access guarantees

I understand that vkQueueSubmit defines a memory dependency on prior host operations, but I am concerned about the implications around future host operations.

(7.9 Host Write Ordering Guarantees)

For example, say I update a vertex buffer on each frame with multiple frames in flight. Tutorials seem to recommend: maintain memory on host as “ground truth” data, and create a vertex buffer on device for each frame. While preparing each frame, copy data to the appropriate buffer via memory mapping. I understand the vkQueueSubmit guarantee, and the fact each frame has exclusive access to its own buffer there is no access violation on any future host operations for that buffer. (Alternatively, use a staging buffer for each frame - the vkQueueSubmit guarantee applies to the transfer buffer, and explicit memory barriers are added for the buffer copy.)

I’m curious how to correctly do this without copies. I could create a single vertex buffer in host visible, host coherent memory. On a given frame I can update that data via memory map, and the vkQueueSubmit implictly guarantees those updates be available for render. However on the next frame, how can I be sure that the updates are not an access violation while the prior frame is still reading it?

Is it enough to include a buffer memory barrier after all draw calls are submitted? Is it necessary? Pseudocode:


vkCmdEndRendering  (or vkCmdEndRenderPass)

    srcAccessMask: MEMORY_READ (or SHADER_READ)
    dstStageMask: HOST
    dstAccessMask: HOST_WRITE


I test with and without the above barrier, and things seem correct and no validation layer complaints. However my little toy project I’m using to learn does not stress either my GPU or CPU at this point, so I don’t really trust “it seems fine”. Surely in a toy project the bottleneck is elsewhere in the driver and I’m not seeing these access violations.

If I misunderstand the barrier guarantees and this is not sufficient - what’s the right way to handle it? It seems the host would actually need to wait before updating the buffer on the next frame, so a fence on vkQueueSubmit that the next frame waits for and resets? Surely that’s where the waste in performance is and that’s why this technique isn’t used?

I did find an answer, so I feel I should update this topic for others:

No, this barrier is not necessary or sufficient for the task. Split the submission and use a fence or timeline semaphore instead.

It is not necessary because memory barriers only deal with writes, not reads. A memory barrier has no use in a write-after-read hazard, which this is. An execution barrier is needed instead.

It is not sufficient because the implicit execution barrier of vkCmdPipelineBarrier only holds on-device. If the write-after-read also happened on-device, then a pipeline barrier is sufficient; but since the write-after-read in my scenario occurs on host, explicit synchronization is needed.

Split the command buffer into multiple. With timeline semaphores, you can use a single vkQueueSubmit with multiple VkSubmitInfo. With fences, you need two vkQueueSubmit.

The first submission includes normal draw command buffers; its implicit host-to-device memory dependency is sufficient.

The second submission contains a single command buffer with a single vkPipelineDependency(srcStage: VERTEX_ATTRIBUTE_INPUT, dstStage: HOST) execution dependency, and should signal a timeline semaphore (or fence) that the host can explicitly wait on before updating the vertex data.