Send data from GPU to CPU efficiently

I have a program with multiple passes where I’m trying to get information from the GPU back to the CPU. Specifically I want to have an array, with the size being the number of triangles in the scene, that specifies if that triangle overlaps or not (0 if it does not overlap other triangles, 1 otherwise).

I am currently using an SSBO to do so. It works fine for some cases but when I increase the buffer size notably (to 8k or 16k) or when I have multiple loops of small sections of the texture I’m checking the process slows down considerably.

From doing some profiling it seems like a significant amount of the slow-down comes from the call:

glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, numTriangles * sizeof(int), overlapResults.data());

The only functionality I need is an array the size of the the number of Triangles in the GPU initialized to 0. In the fragment shader I update that array to have a 1 every time there is a triangle overlap. Then I want to send that information to the CPU to be able to use it there.

I’m wondering if there is something simpler that I could use for this case that would be more efficient. I’m sending a struct with the array that I initialize in the CPU to the SSBO, which seems like an overkill if I could just initialize the array directly in the GPU. Also, the glGetBufferSubData I mentioned above seems to be affected by the width and height of the buffer and the kernel size, which I don’t think should be the case since I don’t use any of those values to initialize it.

I appreciate any ideas you might have!

The design of CPU rendering via a GPU assumes that there is latency between the CPU submitting the work and the GPU executing the work. That is, the CPU is “running ahead” of what the GPU is executing.

When you make the above call, you basically block the CPU until the GPU “catches up” executing all the queued work submitted by the CPU thus far and until the GPU/driver sends the results you requested back from the GPU to the CPU.

To improve the performance, don’t readback the results of commands submitted for this frame. Read back the results of commands submitted 1-3 frames previous. The GPU should have finished them, so you won’t want as long.

Also, do your readback through a PBO which should help the OpenGL implementation pipeline the readback.

Even better. Don’t do the readback. Redesign your algorithm so that a readback is not necessary.

The glGetBufferSubData causes synchronisation; the call blocks pending completion of any GPU commands which could affect the buffer’s contents. Almost any OpenGL function which returns data to the CPU has this issue. The effect is similar to glFinish, but the implementation is probably smart enough to return when there are pending commands if those commands don’t affect the buffer’s contents, rather than just waiting for all pending commands to complete. So, the time taken doesn’t reflect the complexity of the copy operation but of the GPU commands which generate the data being copied.

As DarkPhoton suggests, the usual solution is to have multiple buffers so the GPU can be filling the “newer” buffers while the “oldest” buffer is being read. You can use a fence object, created with glFenceSync and polled with glGetSync to determine when the GPU has completed execution, allowing the data to be retrieved without blocking.