Faster Readbacks of Atomics

:question: Can anyone think of some unconventional methods for reading back the final value of a GPU atomic counter to the CPU? I’m looking for alternatives.

Why? Well, the obvious method is slow (here, at least). That being:

  1. Update atomic on GPU side
  2. glMemoryBarrier( GL_BUFFER_UPDATE_BARRIER_BIT ) ← SLOW!
  3. Readback atomic buffer, either delayed or not, to the CPU (e.g. glGetBufferSubData()).

The problem is, #2 is very expensive to queue on the CPU side. Not #3, #2. I’m seeing 0.5-0.7 msec just to queue this one GL call. I have no idea what the driver’s doing within this glMemoryBarrier() call, but it’s probably way more work than I really need it to. glMemoryBarrier() doesn’t give you fine-grained control.

Funny thing is, using many other barrier bits (GPU-recipient bits) causes this glMemoryBarrier() bit to consume almost no CPU time when queuing it. But that doesn’t get me access to the atomic value on the CPU.

Any suggestions?

Vulkan interop? Obscure NVIDIA extensions? Any other ways to make incoherent atomic writes visible besides glMemoryBarrier()?

This is not too surprising if you think about what “buffer update” really means. Buffer update is for CPU interactions with the buffer’s data. This means that after the call, CPU processes that touch the memory need to be able to see the results. However, you can access a buffer via a persistently mapped pointer, rather than an OpenGL call. As such, there may be no place after the barrier for the implementation to insert a CPU wait for the operations to complete.

So the barrier call itself has to do that.

OpenGL’s synchronization API is just not low-level enough to be able to accurately express what you need to in order to avoid stalls like this. Ideally, a fence sync object could do it, but the “buffer update” barrier has no idea what specific prior commands you’re talking about. Therefore, the implementation has to assume that it is all of them and do a full finish.

I don’t think there’s a faster way of doing this readback in GL. If that half-millisecond is important to you, I don’t see a better alternative than switching to Vulkan with its much finer grained synchronization. A simple host event+memory barrier would be all you need as part of the batch issuing the compute operations.

Right! Any buffer object, anywhere (CPU or GPU mem), based on any past incoherent update. Sledgehammer synchronization.

Right, except that for this one specifically, my read is there’s an exception. Here you want GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT + sync wait. Still, you need a barrier.

I’m glad that I’m not the only one thinking this. I’m no Vulkan expert, but I cannot fathom how someone that knows the low-level sychronization details required in a modern GPU graphics driver and that values high performance results from their driver could possibly have conceived of the glMemoryBarrier() interface. Not only conceived of it, but deployed it as the solution. Then again, I’m a mere app developer, so maybe I’m missing something obvious here.

Ok, thanks for your feedback. While sometimes I realize it’s overkill, here I definitely do wish for Vulkan-like fine-grained event synchronization.

I think it’s more a matter of being the easiest way to handle it from an API perspective.

Consider the API of vkCmdPipelineBarrier. It has dozens of things to play with: the source and destination stages and scopes, an arbitrary number of memory ranges defined by resources, etc.

The only way to make something like that work from an OpenGL API perspective would be to create a whole object that encapsulates a barrier’s contents, then fill in its multitude of possible parameters, and pass that to glMemoryBarrier. And yeah, that would have been really great. But I imagine that the current API was picked because it was less complex.

Also, it’s a lot harder to conceptualize the behavior of such things without Vulkan’s notion of pipeline execution stages (and its memory model). Note that the Vulkan spec dedicates an entire chapter to synchronization. The sledgehammer approach is obviously not ideal, but it is a tool that can be much more easily specified.

Also, image load/store is over 10 years old.