GLSL Compute Shader - synchronization

Example case: at the begining of the shader each thread write some data to the workgroup shared memory (for example some array). Next I want to be sure that every thread has finished his job because I need to perform some computation and then a shared memory will be read. A correct way to achieve that looks like this:


Without using memoryBarrierShared() there is no guarantee that a data which is written to the shared memory in one invocation will be visible in another. Is that true for both Nvidia and AMD ? I wonder how it is related to the hardware and other GPGPU API’s ? Similar fragment of the code in CUDA requires only one barrier function:


Is that true for both Nvidia and AMD ?

Yes; it’s part of the OpenGL specification. If it doesn’t work, then it’s a (rather serious) driver bug.

I wonder how it is related to the hardware and other GPGPU API’s ?

You’re thinking far too deeply about this. The OpenGL specification device visible behavior. It defines that calling [var]barrier()[/var] in a compute shader will halt all executing work items in that group until all of them have reached that point. It defines that calling [var]memoryBarrierShared()[/var] will cause previously executed writes to shared memory to become visible to other items in the same work group.

It’s up to each implementation to implement this correctly. How the implementation does it for your specific hardware is irrelevant. If the implementation detects the pair of calls and folds them into one internally, it doesn’t matter to you as an OpenGL programmer. If the implementation has to do the equivalent of a [var]barrier[/var] when you call [var]memoryBarrierShared()[/var], again, that doesn’t matter to you. All that matters to you is that the implementation does what OpenGL says it should, that calling the functions the specification requires has the appropriate behavior.

If you’re concerned about performance, that again is something drivers are largely responsible for. Unless you’ve measured a performance drop, let the optimizer do its job. NVIDIA is very aware of what the spec says, and they will have written the optimizer knowing that people will frequently use these two functions together.