SwapBuffers and Synchronization

leoner · August 16, 2021, 10:07am

Hi everyone,

I’m currently working on a shader that emulates what compute shader does but for older versions of OpenGL (OpenGL 2.0 as lowest target).

Currently I’m just issuing drawing commands and calling glReadPixels immediately after in order to retrieve the result. I know this approach forces synchronization between the CPU and the GPU and I’m going to rewrite it using PBOs as soon as possible.

But now I really want to know why I’m getting such degraded performance with this approach.

Unfortunately I’m not able to replicate the time I see CPU-side using Nvidia nsight but I’ll attach a screenshot to show the order the commands are issued (glFinish is there just for debugging purpose):

Those are the results I see from the thread I’m calling gl functions from:

For glFinish() I’m restarting a timer before the call and stopping the timer immediately after.

So my question are:

How can it be that the CPU has to wait so much time before glFinish() returns given that it’s called after the first few commands after SwapBuffers? (I’m using SwapBuffers as delimiter in the screenshot)
From Synchronization - OpenGL Wiki :

Swapping the back and front buffers on the Default Framebuffer may cause some form of synchronization (though the actual moment of synchronization event may be delayed until later GL commands), if there are still commands affecting the default framebuffer that have not yet completed. Swapping buffers only technically needs to sync to the last command that affects the default framebuffer, but it may perform a full glFinish.

The meaning of this is not entirely clear to me…should I expect that all the commands concerning the default frameBuffer have been executed (not just issued) after a swap or not?

What does this sentence exactly mean?

Swapping buffers only technically needs to sync to the last command that affects the default framebuffer

Dark_Photon · August 16, 2021, 12:32pm

No, absolutely not. In the general case. SwapBuffers() queues immediately taking virtually 0 wall clock time on the CPU.

Now driver- and/or window-specific implementation details may cause there to be more CPU time consumed in the process of the CPU queuing SwapBuffers(). But you can’t depend on this. You’re queueing SwapBuffers() here on the CPU for future execution, not executing SwapBuffers() on the GPU and waiting for it to finish on the CPU. If you want to wait on the CPU for GPU work to complete execution, then you need to use fences and/or glFinish().

(Related: Then you get into the interesting question of what does it mean to wait on SwapBuffers() to complete execution on the GPU. Answer: It’s not what you think. If you want more details here w.r.t. that last paragraph and NVIDIA desktop GL drivers, just ask.)

It’s poorly specified. I wouldn’t put too much stock in it.

GClements · August 16, 2021, 2:47pm

PBOs require 2.1 or the ARB_pixel_buffer_object extension.

leoner · August 17, 2021, 7:12am

It would be nice to knw more about the topic, also if you could tell me where to find those info (and in general, where to look when I need this kind of answers) would be even better.

If you want to wait on the CPU for GPU work to complete execution, then you need to use fences and/or glFinish() .

So if I can’t/don’t want to use fences there’s no way to know how many frames (SwapBuffers command) to wait before mapping the PBOs (i.e. how many PBOs to use for reading). I guess the more the better?

Dark_Photon · August 17, 2021, 2:55pm

There isn’t one great source. Bits of this you can infer from occasional graphics presentations you might read. But the bulk of it can be gleaned from just watching how your application’s work is queued and executed by the driver in a profiling tool like Nsight Systems (or similar for other vendors).

When rendering to a window – at least on NVIDIA drivers, it appears:

SwapBuffers() + glFinish()

will wait on the CPU until all previously-submitted rendering work for the target window image has completed execution on the GPU. Assuming “something” has been drawn targeting the window, this means: 1) window swap chain image acquisition, 2) execution of all previously-submitted draw work targeting that window image. But this apparently does not include: 3) insertion of this image into the window swap chain, 4) acquisition of a new image from the window swap chain to render the next window frame into, nor 5) actually waiting until the image just rendered proceeds through the swap chain and actually starts being scanned out by the GPU for display.

To wait until #3 and #4 have competed, what you want instead is something like this:

SwapBuffers() + glClear( window ) + glFinish()

Of course, if you don’t want to block the CPU until this point for the current frame but instead for a past frame, then use a Sync Object instead of a glFinish(). And you don’t get to wait for #5. That’s outside of your ability to view with GL. However, with specific swap chain configurations and display methods, you might be able to infer when #5 will actually occur.

No. More frames = more latency. Depending on your GPU, GPU driver settings, and rendering workload, 1 or 2 frames is probably good. But it’s going to depend on things like how many frames you allow the driver to queue ahead. You should be able to tell what you need by looking at the scheduling of your rendering work using a tool like Nsight Systems.