Unexpected V-Sync behaviour

I wrote a simple windows application which performs the following actions in a loop:

  • Fills the buffer with a random color using glClearColor and glClear
  • Performs a busy sleep for 20ms using std::chrono::high_resolution_clock
  • Swaps buffers
  • Calls glFinish

Given that my display has a refresh rate of 60Hz, as far as I understand the BufferSwap should be possible every 16ms. By introducing the delay of 20, I hoped to observe that a new frame would be displayed every other V-Blank, giving me the effective framerate of 30FPS. This however is not the case.

When my delay is below 16ms then glFinish blocks, capping the fps at 60. When my delay is above 16ms, glFinish returns within 1ms, and I get FPS values which are not fractions of 60. As far as I understand this should not be possible without using triple buffering, which I disabled in my GPU control panel.
Clearly I’m missing something, but what is it?

I know your puzzlement. I’ve been there too. :slight_smile:

Basically, your and my initial mental model of how the driver works isn’t correct. For me, the key was seeing how the GPU frame execution timing shakes out relative to the VBlank clock in Nsight Systems. Details below.

Setup

First, some setup prep. I’m going to assume NVIDIA graphics drivers on Windows. Make sure you have these active in your NVIDIA driver settings:

  • Low Latency Mode = ON
  • Monitor Technology = Fixed Refresh
  • Power management mode = Prefer maximum performance
  • Threaded optimization = OFF
  • Triple buffering = OFF

Next, make sure your app is creating a Fullscreen window (i.e. Flip present mode is active when focused). This ensures that your app’s frame display will be synchronized with monitor scan-out … not some virtualized VBlank clock invented by the DWM compositor.

The above settings will ensure that your app, when rendering Fullscreen with Flip present mode, will be driving the GPU as close to directly as possible (…with stock GL), with a double-buffered swap chain (start with this and get it working first; then try variations), which is displaying at a fixed display/scan-out rate (not some continuously variable scan-out rate invented by the graphics driver/display).

App Frameloop

So with that preamble… What you really want isn’t:

  • Window glClear() + Draw Frame + SwapBuffers() + glFinish() + Capture time

but instead:

  • Draw Frame + SwapBuffers() + Window gClear() + glFinish() + Capture time

Insight

So what’s the difference? All of them do Clear + Draw + Swap. The difference is in the 2nd we wait for the GPU to finish execution after the Window Clear rather than after the Swap.

Why?

The 1st doesn’t do what we want. It blocks the CPU waiting on the GPU to finish rasterizing the current frame (which is somewhat variable frame-to-frame … so the timings are as well). These timings do not include the time after this waiting to obtaining a new swap chain image for the next frame.

The 2nd fixes this, blocking the CPU until the GPU has not only finished rendering the current frame but also acquired a Swap Chain image from the Swap Chain to render the next frame into. And with a true double-buffered Swap Chain, this Swap Chain Image acquisition will always happen on a VBlank clock tick. So you should get the consistent 60Hz 16.6 msec/frame multiples for this timer that you expect (60Hz, 30Hz, 20Hz, 15Hz, etc.). Assuming your CPU draw thread is getting all the CPU cycles it wants, that is.

Forcing a Frame Miss

As for your adding that 20 msec “sleep” to force a frame miss for testing…

With the revised frameloop, add that here, as follows:

  • Draw Frame + SwapBuffers() + Window gClear() + glFinish() + Capture time + Sleep 20 ms

This will cause the CPU to waste 20 msec immediately after the VBlank clock tick that displays the last frame. Inside of that sleep, both the CPU and GPU should deliberately miss the next VBlank clock tick. … And then your CPU draw thread will start queuing the next frame for GPU rendering.

Thank you for explaining it so clearly. I did what you recommended and my program now behaves as expected. The troubling thing is, now I can’t get it to work like it did before without enabling triple buffering. Which is good, and it should be that way, it’s what I was expecting all this time, but still, I’m really curious. Maybe fiddling with the control panel initialized something that wasn’t there before, and now it will always work correctly?.

Anyway, thanks again for making it clear for me.

That’s great! And sure thing!

What’s happening now?

If you mean that, with the above changes I suggested, you’re now consistently missing frames (e.g. dropping to 30Hz), then that is good. Because now you can tell on the CPU whether the CPU+GPU time for a single frame is overrunning a single 60Hz 16.66 msec frame window.

With this, you can determine what is causing the biggest bottleneck, fix that, and re-check to see if your CPU+GPU time for a frame is now fitting within a single 60Hz window.

Failing that, you can just punt, disable the glFinish(), enable triple buffering if needed, and “hope for the best”. That is, allow more CPU / GPU overlap. This is a risky business through, because all it takes is one implicit sync triggered in the GL driver and that driver queue-ahead you’re depending on for good performance comes to a screeching halt and you end up missing 1+ frames, resulting in stuttering.

Not knowing what you’re seeing I’m not sure what you mean.

If you were careful about changing settings, then you should be able to flip the settings back, disable the glFinish(), and get back to “the way things were”.

What I meant, is that after restoring my program to the state it was in before the suggested changes, and resetting NVIDIA settings to their default values, it still works just as you described it, whereas before it behaved as if the triple buffering setting was enabled, regardless of whether or not it actually was. I have since tested this on multiple machines, and on all of them v-sync was working correctly. Given that, I think something was wrong with the way I was testing the program, like it failing to aquire exclusive fullscreen, even though I was pretty sure I checked it with PresentMon.

Hmmm. That’s interesting. If you did truly restore all of your NVIDIA settings back to their prior values…

Here’s one idea. Are you still creating a Fullscreen window and giving it window/mouse focus, to get Fullscreen Flip Presentation mode?

If not, then you aren’t getting driver queue-ahead behavior. Failure to get this prevents frame overlap and can appear as if triple buffering isn’t enabled or the CPU can’t queue as far ahead of the GPU as before (which it can’t).

That’s definitely worth checking! Both in PresentMon if you can. But also giving your window focus and then doing moving mouse focus to/from your window (via mouse-clicking or Alt-Tab to/from your window). If you don’t see a full screen “flash” as DWM switches focus to/from your window, then you aren’t getting Fullscreen Flip present mode and thus aren’t getting driver queue-ahead, allowing CPU/GPU overlap.

Also, those latency and triple buffer settings in NVIDIA settings IIRC control how “much” the driver is allowed to queue-ahead. If you don’t have them reverted to their prior settings, then you may be getting less queue-ahead than before.