Where in the pipeline does NVidia's "Low Latency Mode" kick in?

RealtimeSlave · October 24, 2020, 2:12pm

Talking about the “Low Latency Mode” in the NVidia control panel which is around for quite a while (not talking about the new Reflex stuff).

This mode is “limiting queued frames to 1” but I don’t get which queue they are talking about?

Is that some hidden queue in the driver that comes after the swapchain which was setup by the game engine?

Or how does it interact with the swapchain for which the image count (including all the dedicated command pool setup) is configured by the developer?

RealtimeSlave · October 24, 2020, 3:57pm

So meanwhile I found out that “Low Latency Mode” does not have any effect in DX12 or Vulkan. In DirectX 12 and Vulkan, the engine decides when to queue the frame and the NVIDIA graphics drivers have no control over this.

Dark_Photon · October 25, 2020, 1:49am

No. The swap chain is near the tail end of the pipe, downstream of the GPU. The queue you’re asking about is the application command queue upstream from the GPU. Think of it like this:

App CPU thread -> Graphics Language Commands -> "Prerender queue" -> GPU -> Rendered Images -> Swap Chain -> Image Compositing/Display

Basically, “Low Latency Mode” (previously the “Max Prerendered Frames” setting) sets the number of frames of graphics commands that can be pre-queued up between the App CPU thread and the GPU/back-end graphics driver. It’s how far the driver lets the app CPU thread “queue ahead” before providing explicit backpressure up the pipe to slow down.

The length of this Prerender Queue, the length of the image swap chain (and swap chain selection mode), as well as the video display rate are normally the main drivers of latency through this entire pipeline.

RealtimeSlave · October 25, 2020, 10:10pm

Thanks for the reply! Is this “prerendered queue” in DX9/DX11/OpenGL setup by the engine? I guess not, otherwise the NVCP could not simply force it to 1 without breaking engine functionality. Also I cannot remember that I ever had to create some command frame queue in OpenGL 3.2. So does this mean the driver has an internal command buffer frame queue? If the driver maintains this queue, how does for example the OpenGL driver know when a command frame is starting and when it is finished? And how does it interact with things like SwapBuffer and glFinish?

Alfonse_Reinheart · October 25, 2020, 11:43pm

OpenGL was designed to allow for the possibility of asynchronous operation from the very beginning. Commands behave as if they were executed synchronously, but the implementation can do whatever it wants within that “as if”.

To facilitate asynchronous implementations, OpenGL recognizes a distinction not just between when a command is given to OpenGL and when it is completed, but also a third state between the two: being issued but not yet complete. That is, the command is going to be completed, but isn’t completed yet.

Hardware-wise, you can look at it like this. GPUs have a FIFO queue of commands that they execute. The CPU’s job is to feed the FIFO. But there’s a problem: feeding the FIFO is very expensive (and in older days, the FIFO wasn’t arbitrarily large, so it could only store so many commands). As such, you don’t shove each individual GL rendering call into the FIFO.

Instead, you stick it in memory somewhere. At some point in the future, you bundle together a bunch of commands and shove them into the FIFO.

This is why glFlush and glFinish are distinct functions. glFlush puts all commands into the FIFO and will not return until they are all there. glFinish does that, but also waits until those commands have completed execution.

Within these boundaries, implementations have the freedom to do whatever they want.

A driver can define a “frame” however it wants. Within this discussion, a “frame” beginning would generally be the point when the CPU will feed the FIFO the data for the previous frame. A common definition for a “frame” of this sort is all commands between any two swap buffers calls. So typically, the call to swap buffers, or the first GL function called after swapping buffers, will shove a bunch of stuff into the GPU’s FIFO.

Of course, there are also things that can force the CPU to flush the FIFO.

Dark_Photon · October 26, 2020, 1:55pm

This setting definitely affects OpenGL programs, but I can’t speak to DX.

As to whether the engine would set this up or not… There isn’t a GL/WGL/GLX API (AFAIK) that controls this behavior. It’s up in the realm of “driver settings” state. So for an engine to set this up, it would have to be monkeying with the driver settings. While accessible from NVCP, you can also get to this setting via NVAPI. See PRERENDERLIMIT_ID.

You don’t. It’s created implicitly for you in the driver when you create a GL context.

Look at how Vulkan manages GPU interaction for clues

A stock OpenGL app is provided some visibility into this driver behavior (up until the swap chain insertion) via Timer Queries and Sync Objects

The Max Pre-rendered Frames setting basically sets the max number of queued SwapBuffers() calls that are allowed to exist in this queue at once. This is the “length of the rope” the app gives the GPU (with the GPU driver trailing along behind the app). If the GPU (back-end driver) gets too far behind, the app has to wait for it to catch up before moving on. That “wait” happens in the front-end driver, and causes the app CPU submission thread to block.

glFinish() should result in the app being put to sleep until all commands in this “Pre-render queue” have been executed by the GPU/driver. Basically after returning from this, the queue should be empty.

RealtimeSlave · October 31, 2020, 1:08pm

Thanks a lot guys for taking the time to explain this!

system · October 19, 2021, 1:57pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.