The software I’m working on does a lot of image compositing, some of it using Photoshop-like image blend modes that both read and write the destination (read-modify-write). In the current OpenGL renderer, this is done by first copying the destination texture before using it as a framebuffer attachment and sampling the copy. This obviously isn’t great but short of using extensions that allow reading the previous value of the destination pixel (and said extensions seem reliant on tiled renderers on mobile platform), it still seems like the only way to achieve this.
We’ve been planning to migrate to a Vulkan architecture, which I thought would give better control over this, but I may have been operating under false premises. With the level of control over subresources allowed by Vulkan, I thought it would be possible to use part of an image for writing while a different one is being read to (i.e.: copied). But no commands appear to allow copying an image region while inside a renderpass, and using a shader to do so would require beginning a new renderpass anyway to define load/store barriers. The documentation for VkFramebufferCreateInfo as well as discussion on Vulkan-Docs issue #299 seem to confirm this is impossible.
But because this is strictly reading and writing the same fragment, it seems like it might be doable using attachments and a subpass that simply forwards the input attachment so its value can be read while being used as output attachment? If anything, Alfonse_Reinheart’s reply to question #7035 in this forum seems to imply that it might even be doable within the same draw call?
I don’t have code examples yet, let alone an actual Vulkan backend, but clearing this up would strongly inform the architecture work I’m putting together right now.
(Apolgies for just referencing instead of linking–as a newly registered user, it appears I’m not able to post links.)
Texture barriers are pretty spot on, interesting! I’ve been glossing over anything more recent than OpenGL 4.2, due to having to support macOS. Just doing reordering of draw calls to try to stagger the writes and copies to different (scissored) part of the image wasn’t making that much difference, most likely because the driver is forcing synchronization when using a single framebuffer object, rather than multiple texture views on the same texture as described. I’ll give it a closer look if we want to further refine the OpenGL implementation.
My hope was that for macOS, a modern Vulkan implementation would be mostly functional through MoltenVK, but even that’s not guaranteed. When you say a only single read/modify/write is possible, I take it you mean over multiple draw calls on the same fragment, within the same render pass (vs. writing multiple times in the same shader, which really is just modifying a shader register rather than a round trip to image memory)?
You can only do a single read/modify/write per-sample; you have to put a barrier between each read/modify/write in a subpass.
Thinking in terms of tile based renderers and what input attachments translate to, I’d venture a barrier would also be necessary for write + read-modify-write? For instance, a normal blended image underneath a “multiply” blended image would first output a straight alpha-blended value using the blend unit, then output a full destination-blended value for the multiply. Without a subpass barrier, the input attachment for the second draw would not be guaranteed to be up to date and might have the value from the beginning of the pass.
To the extent that MacOSX still supports OpenGL, its implementation provides NV_texture_barrier, which is the same thing as the language feature.
I mean within the same subpass on the same sample, period. Different draw calls, the same draw call, it doesn’t matter; if you’re need visibility or ordering of writes and reads, then you need a barrier to establish that order.
That’s not too surprising. I too have worked on drivers (one for a tile-based GPU) where the underlying submit, transform, and rasterization queues for a specific render target are tightly bound to the GL FBO/framebuffer that the work was submitted through. To avoid internal driver synchronization (full pipeline flushes) on these drivers, you had to round-robin across a set of FBOs.
The tradeoff is that FBOs aren’t free. There’s underlying driver memory associated with each, not to mention the cost of reconfiguring one. Some drivers have (relative) fast paths for just rebinding the attachments when the size and format of the FBO doesn’t change though. But on tile-based, I’d still expect a full pipeline flush if you’ve rendered to this FBO recently.
Right, and I surmise this is why this was formalised in Vulkan by making the framebuffer an intrinsic part of the render pass. If the driver/hardware works like that under the hood, might as well make it explicit.
In this particular case, the FBO doesn’t even change, glCopyImageSubData just gets called while the texture is still used as a FBO attachment. This has been behaving normally, so what I meant by forcing synchronization is that this must trigger some implicit memory barrier (or I’ve been incredibly lucky ).