OpenGL Compute Issue, Possibly Switching to Vulkan

Hi,

I’ve been having an issue with OpenGL where using a compute shader to copy a struct of around 300 bytes between buffers is 100s of times slower than I expect. If I eliminate the double buffering and just update the struct, it’s real-time performance fast, but not setting the world on fire. If I just copy the struct one line at a time, for each vec4 I write the performance is halved.

If this is driver issue and the shader is not being optimized, switching to Vulkan and maybe also using the SPIR-V optimizer could solve my problem. But this is a large project I have and I want to do some research first before putting all the leg work in. Has anybody using Vulkan been doing anything similar and had performance issues?

Why are you using compute shaders to do simple copy between buffers? There is glCopyBufferSubData.
Similarly, Vulkan also has dedicated copy commands. There’s nothing much to optimalize anyway in a shader – copying 300 bytes is quite straightforward. You are probably looking at the overhead of setting up a compute pipeline, enqueueing the work onto GPU queue, etc.

[QUOTE=krOoze;42815]Why are you using compute shaders to do simple copy between buffers? There is glCopyBufferSubData.
Similarly, Vulkan also has dedicated copy commands. There’s nothing much to optimalize anyway in a shader – copying 300 bytes is quite straightforward.[/QUOTE]

Because I’ve stripped away all the actual code in the shader to find where the performance issue is. The goal is not to copy between buffers, that’s just what drops my application to <1fps. So I can not do any operation that requires double buffering without killing performance.

I’ve tested that this is not the case, the number of instructions to copy the struct is what changes the execution speed, as well as being between buffers.

I’ve been having an issue with OpenGL where using a compute shader to copy a struct of around 300 bytes between buffers is 100s of times slower than I expect. If I eliminate the double buffering and just update the struct, it’s real-time performance fast, but not setting the world on fire. If I just copy the struct one line at a time, for each vec4 I write the performance is halved.

None of this is surprising.

Compute shaders are for computing, not shuffling data around. They’re pretty terrible at copying data. Having each compute shader invocation read 300 bytes of memory, then write 300 bytes of memory is going to be extremely slow.

Moving from reading and writing to different locations to reading and writing to the same location certainly should improve performance. Because at least in that case, the memory addresses you’re updating are already in the cache. Similarly, writing less data ought to increase performance, since that’s what is driving your performance.

The best optimization you could do is to stop copying data. In your OpenGL thread, you mentioned you were implementing a sort algorithm. Well, you don’t have to copy data to do that. Your sort algorithm should sort indices, not the actual struct objects. That is, instead of copying a struct into its location in the sorted array, you copy an index to the struct in the sorted array of indices. And the CS doing the sort should only read the absolute minimum data from the struct that it needs to in order to do the comparison.

And the comparison data ought to be in an array by itself. That is, you should make your data a struct of arrays (and your CS doing sorting should only access the array(s) that it needs to), not an array of structs.

Great, this is what I needed to know. It might not be surprising to you, but It’s surprising to me, because it’s hard to find this information and I’m easily within L1 cache so everything looks fine.

I’m sorting the structs so they’re in coherent read order for later steps, but if can’t it that that’s fine, it could be possible to find another way to mitigate the cache in-coherance. The issue is more how much data my algorithm requires writing, I will re-design and hope I can find a way to only write single vec4s each time.