understanding the opengl main loop / swapbuffers

I’m trying to learn opengl through a tutorial and there is typically a main loop like this:

do {
  glClear(...);
  render_scene_with_lots_of_opengl_commands();
  glfwSwapBuffers(...);
} while(true);

As I understand it, there are two buffers - one currently shown and one you draw on. The buffers cannot be swapped at any time because that would cause tearing so swapbuffers() will wait for vertical blanking, a short time time interval that occurs once every 1/60 seconds (if that’s your screens refreshrate) during which the swap can take place without tearing.
Now, let’s say I also want to add a somewhat timeconsuming calculation to be done for every drawn frame that does not involve drawing into the current draw-buffer. I’d expect the code to look like so:

do {
  glClear(...);
  render_scene_with_lots_of_opengl_commands();
  glReadyToSwapBuffers();
  doCalculations();
  glfwSwapBuffers(...);
} while(true);

The extra gl-call I added, glReadyToSwapBuffers(), may not exist or be called something else. It is supposed to tell the opengl framework that I’m finished drawing the scene and that I want it to be displayed at the next moment when that can be done (vertical blanking). The difference between this and glfwSwapBuffers() is supposed to be that glReadyToSwapBuffers() should simply schedule the switch (perhaps to be completed in an interrupt) and return immediately. The point is that I could then doCalculations() in parallel with waiting for vblank instead of doing these two serially, i.e. if the entire loop takes a little more than 1/60 second I could get, say, 50 fps instead of 30.

My question: is there a call like my suggested glReadyToSwapBuffers() and if not - why not?

The nearest thing to such a call is glFlush which tells the driver to start issuing buffered-up commands, but can return immediately so that you can do other work in parallel.

You will have noticed that OpenGL itself does not actually have a SwapBuffers call. This is because OpenGL delegates this responsibility to the underlying windowing system, or a framework built on top of the underlying windowing system - in your example GLFW. GLFW is not part of OpenGL, nor are other frameworks such as GLUT or SDL.

If you have useful work to do which you wish to run in parallel with a buffer swap, why not just run it in another thread?

why should there be such a function ? what should it do / return ?

should it wait until opengl has finished rendering the frame ?
not necessary, because thats what “glfwSwapBuffers()” does

should it return a bool just to check if opengl has finished rendering the frame ?
you can do that by using a “query object”, but how would you check that?


while ( openglstildoessomework() )
{
... do other stuff ...
}

the problem with that would be that “…do other stuff” could take more than 1/60 seconds which would slow down the entire main loop

the best way (imho) is to do everything non-graphics-related work in another thread(s), the main loop only does graphics

For people who don’t care about VSync (ie if it’s disabled in the graphic card control panel, or if equivalent OpenGL codes have been honoured by the hardware), then swapBuffers or subsequent GL calls will not wait. The reasoning is almost the same if the monitor vsync rate is different (50, 60, 100 Hz), or if, for some reasons you get bounded, then the vsync get divided by 2 (or even more). Then you cannot know what amount of work you can achieve.

Doing it in another thread, is something you can check. However, if what is sent to GL depends on what is calculated in this other thread, synchronization might become an issue. For example, if you have to wait for the whole calculation to be done in order to have something ‘showable’ (for example complete, or coherent…) on the screen, you’ll loose one frame. You might still render at 60 fps (if this is your monitor settings), but you might only really put new contents to the display only half…

Finally, since you told you are learning OpenGL, I would suggest you to stick in a single thread. If your computations start to slow the rendering, just like john_connor said, then try to disable vsync. You might then have a decent framerate without cutting it by half or more.

I expect glReadyToSwapBuffers() to return immediately, with no value, after having scheduled an extra command to be run when all the gpu’s drawing commands currently queued for the drawing-buffer are completed, which would:

  • check if we are in the vblank region right now, and if so swap buffers immediately
  • otherwise set a flag to inform an interrupt that runs every time the vblank starts that it is supposed to swap buffers
  • Ensure that a flag is set if the buffers are swapped by any of the two means above, which will tell the ordinary glfwSwapBuffers() that it doesn’t need to do anything as the job is already done.

Of course I can let other threads do calculations as well, I just wanted the first thread to be able to work while waiting for vblank.

Yes, two if double-buffering is enabled. 3 or more if triple/multi-buffering is enabled. (It sounds like you may want to read up on that by the way: Triple buffering).

The buffers cannot be swapped at any time because that would cause tearing so swapbuffers() will wait for vertical blanking, a short time time interval that occurs once every 1/60 seconds (if that’s your screens refreshrate) during which the swap can take place without tearing.
Now, let’s say I also want to add a somewhat timeconsuming calculation to be done for every drawn frame that does not involve drawing into the current draw-buffer.

As mhagain said, you can run it in another thread, possibly pipelined with the draw thread. Or you can run it synchronously on your draw thread, but that means you’ll have less time to submit draw commands to the GPU (assuming you don’t break a frame; i.e. drop to < 60 Hz), cutting into the max complexity of content you can render.

The extra gl-call I added, glReadyToSwapBuffers(), may not exist or be called something else. It is supposed to tell the opengl framework that I’m finished drawing the scene and that I want it to be displayed at the next moment when that can be done (vertical blanking). The difference between this and glfwSwapBuffers() is supposed to be that glReadyToSwapBuffers() should simply schedule the switch (perhaps to be completed in an interrupt) and return immediately.

That’s what SwapBuffers pretty much does. On drivers I’m familiar with, it merely queues a “I’m done; you can swap when ready” event on the command queue. If the command queue isn’t full (by driver-internal criteria), then you get the CPU back and can do what you want – before the Swap has occurred. Some driver’s will buffer up as much as another frame or two of commands before they block the draw thread in a GL call because their “queue full” criteria is met.

To force the driver “not” to do that (on a desktop discrete GPU only; i.e. sort-last architecture) and instead wait for the swap, you put a glFinish() right after SwapBuffers. When that returns, you know that the SwapBuffers has occurred, which means you’ve synchronized your draw thread with the VSync clock.

Re driver queuing, if you use triple buffering, that allows you and the driver to get further ahead on drawing the next frame before VSync even when VSync is on. This even allows the driver to start rasterizing the next frame before the Swap actually occurs when the buffer pipeline would otherwise be full.

The point is that I could then doCalculations() in parallel with waiting for vblank instead of doing these two serially, i.e. if the entire loop takes a little more than 1/60 second I could get, say, 50 fps instead of 30.

You can still do this, even in the same thread if you’re careful. But again it’s going to eat into how much time how can submit draw work. Keep in mind that instead of doing these calculations on the draw thread, you could be submitting draw work for the next frame.

[QUOTE=drhexgl;1286277]I expect glReadyToSwapBuffers() to return immediately, with no value, after having scheduled an extra command to be run when all the gpu’s drawing commands currently queued for the drawing-buffer are completed,
[/QUOTE]
That’s what the underlying function (glXSwapBuffers() on X11, SwapBuffers() on Windows) does.

However, it necessarily first issues an implicit glFlush(), and if there isn’t enough space to flush pending commands from CPU memory to GPU memory, it may have to block until there is. Performing an explicit glFlush() first reduces the chances of the buffer-swap function blocking.

Furthermore, there’s a limit to how many frames can be enqueued. If you render frames faster than the monitor’s refresh rate (and triple buffering isn’t enabled), eventually OpenGL commands will start blocking. Specifically, attempts to flush pending commands will block until there’s a draw buffer available on which those commands can be executed. Almost any command can cause a flush if the command buffer is full, but the buffer swap command always flushes.

It isn’t possible to poll the state of the command pipeline. If you want to interleave rendering with other operations, use threads.

It’s worth adding to this discussion that Direct3D has a D3DPRESENT_DONOTWAIT option on it’s equivalent call, which will cause it to return immediately (setting the appropriate return value) if the hardware is either busy processing or waiting for a vsync interval. The theory is that you can then do some other useful work and try again later. In practice however, it’s difficult to know how much later to try again or how much other work to do, and the option is not well supported by hardware or drivers.

[QUOTE=GClements;1286279]That’s what the underlying function (glXSwapBuffers() on X11, SwapBuffers() on Windows) does.

eventually OpenGL commands will start blocking…[/QUOTE]

Aha! I thought glfwSwapBuffers() would always block waiting for vblank as neither of the two buffers can be drawn to. But of course it doesn’t need to block unless I start issuing more drawing commands before the switch has happened.

In the Wikipedia page about the current “Geforce 10 series” graphics cards from Nvidia, it says that one of its features is that it has “Triple buffering implemented in the driver level”. As my card is much older (500-series), I thought it therefore wouldn’t have triple buffering.

Triple buffering can be implemented on practically anything; the only exception is ancient hardware where framebuffer memory and texture memory are distinct and there simply isn’t enough framebuffer memory for more than two framebuffers.

But the driver may still limit triple buffering to the refresh rate, in which case buffer swaps can still block.

[QUOTE=drhexgl;1286285]In the Wikipedia page about the current “Geforce 10 series” graphics cards from Nvidia, it says that one of its
features is that it has “Triple buffering implemented in the driver level”.[/QUOTE]

This is a bit confusing. NVidia’s had a “Triple Buffering” setting in their GL driver for many years. This feature they’re advertising for newer GPUs (which they’re calling Fast Sync) is a minor variant of that which allows for skipping buffers, and is what many folks mean when you just say “triple-buffered” without any context.

As my card is much older (500-series), I thought it therefore wouldn’t have triple buffering.

Your GPU likely does support the variant of triple buffering NVidia’s supported for years. If you’re on Windows, see the NVidia Control Panel -> Manage 3D Settings for the enable. If you’re on Linux, check out the “TripleBuffer” setting (documented in /usr/share/doc/NVIDIA_GLX-1.0/README.txt, among other places).

Now the question is: what does that setting do?

Based on info from NVidia (multiple sources, including posts back to 2002) and user benchmarks when the GL driver Triple Buffering setting is enabled, it sounds like the triple buffering support they’ve had for years is possibly a simple FIFO flip-queue (called the 3-long swap chain method in the wikipedia page). That is, a 3-long FIFO of rendered buffers which need to be displayed sequentially – no skipping rendered frames allowed. This can let the CPU start rasterizing subsequent frames earlier, but in so doing increases end-to-end frame latency. However, if the CPU draw thread is much faster submitting frames than the GPU is at displaying them (scanning them out) with VSync, then the CPU draw thread blocks when this FIFO of rendered-but-not-displayed buffers is full, effectively limiting it to the VSync rate just like double-buffering. So, smart use of power, doesn’t burn down the CPU, can allow CPU frame draw to start early, but does increase end-to-end latency over double-buffering which is a con.

What NVidia is calling Fast Sync is this exact same concept, but the list of buffers isn’t a FIFO anymore. Skipping rendered frames is allowed. That is, on SwapBuffers, the fast CPU draw thread never blocks waiting on a new render buffer to rasterize into. It always gets a new buffer immediately to start scribbling on. And on a swap event, the GPU back-end frame scan-out processor starts scanning out “the newest completely rendered but-not-yet-displayed frame” as opposed the “oldest completely-rendered but-not-yet-displayed” frame (as with the 3-long FIFO swap chain method).

This triple buffering method does reduce latency over the 3-long FIFO swap chain method that they called “triple buffering” before. However, it’s a great way to burn down the CPU, wasting cycles and power rendering frames that are never displayed. This allows gamerz to run the game draw thread at 10,000 fps, even though only a small subset of those frames will actually be displayed (60 of them per second, with VSync, to avoid tearing). For the rest of us designing well-behaved OpenGL applications time-synced to VSync, I’m not sure that this really buys us over double-buffering.

Here’s one good read on this topic at Beyond3D:

a related question:

can the monitor display frames faster than its refresh rate (lets say 60/sec) ?
if not, then the hole point of “fast sync” is to have low frame times + no tearing, right ?

The monitor has to 1) receive new frames, and 2) display them. We’re mainly talking about #1 here: how fast the GPU scans out newly rendered frames, and thus how fast the monitor receives genuine new frames based on actual game state. #2 is limited by that.

Yes, monitor tech in #2 may have to do crazy things to support higher refresh rates in #1 like generate and display interpolated frames so the results on monitor type X (e.g. LCDs) don’t look really bad. But that’s a side issue.

if not, then the hole point of “fast sync” is to have low frame times + no tearing, right ?

Close. No tearing, yes. But not really low frame times (low frame times is an input assumption of fast sync, not a result). Low end-to-end frame latency is the goal. Specifically for games that run much faster than the VSync rate (which is typically 60Hz, or 16.666ms per frame).

This latency = The difference between 1) when the frame render for a frame first starts on the CPU and 2) when that frame is displayed on the monitor.

EXAMPLE #1: Suppose you have an app that can render (i.e. submit draw frames on the CPU) at 240 fps, and the monitor’s scan rate is 60 fps. So if we enable this frame-skipping version of triple buffering NVidia’s calling fast sync, we have:

X X X D | X X X D | X X X D | X X X D |

where:
X = frame rendered but never scanned out to monitor (discarded)
D = frame rendered and scanned out to monitor
| = VSync clock (60Hz; every 4th rendered frame if frames drawn at 240 fps)

Note that game state is sampled before each frame render (X and D blocks) on the draw thread.

Here you can see that the total end-to-end latency of a “D” frame is at least ( 1/240 = 4.166ms ) + 16.666ms (the time needed to scan-out and send the frame to the monitor over the video cable. I say “at least” because the monitor internally could add additional latency in its internal display circuity.

EXAMPLE #2: Now consider the case of a naive double-buffered application:

D . . . | D . . . | D . . . | D . . . |

where:
D = frame rendered and scanned out to monitor
. = CPU waiting for VSync

Here you can see that the total latency is 16.666 (render+wait) + 16.666ms (scan-out) + the display latency of the monitor. So we’ve added ~12.5ms of latency.

EXAMPLE #3: That said, a smarter double-buffered application could look like something like this:

. . . D | . . . D | . . . D | . . . D |

where:
X = frame rendered and scanned out to monitor
. = CPU sleep

This gives you the same (or very similar) end-to-end latency as with fast sync and no tearing but without wasting CPU cycles and power generating frames you know will never be displayed.

You can get a similar and in many cases even better (i.e. appearing to have lower latency) result than EXAMPLE #3 using the method of EXAMPLE #2 by extrapolating up game state up by 16.6-33.3ms. In other words, use prediction to try and pre-compensate for the latency in the system. However, prediction doesn’t always match the future (e.g. objects do change directions), so this isn’t always better. That said, in a game which emulates real-world physics, objects do have inertia and can’t just stop and reverse direction on a dime. So in those games, extrapolation over such short time periods should work well.