Bad performance (possibly FBO related) on NVIDIA

I’m developing a 3D engine which is both OpenGL & Direct3D capable.

I’m seeing quite bad OpenGL performance (possible pipeline stalls) when running Linux & the NVIDIA driver (up to 50ms/frame.) When the same machine is booted into Windows, performance is as expected (below 10ms/frame)

I’m seeing this both on a laptop with a Geforce GT540M and a desktop machine with GTX580.

On Mac OS X, the same OpenGL rendering code also works without performance issues on NVIDIA hardware. Also, Linux + AMD hardware seems to work fine.

The performance issue seems to be proportional to the number of times I change the surfaces bound to the FBO (I use a single FBO object.). Therefore forward rendering without shadows works fine, but anything like adding shadows or postprocessing, or doing deferred rendering starts to bog down the performance.

Anyone else seen something like this?

(btw. the engine code is public at Google Code:

Have you tested with multiple FBO:s, one for each surface?

Not yet, I plan to.

Actually I’ve narrowed things down a bit … it is not the amount of surface changes after all, but rather the amount of drawcalls that go to the FBO instead of the backbuffer.

For example a forward-rendered, complex scene without bloom post-effect has no problems, as it goes directly to the backbuffer. But the same scene with bloom on must be rendered to the FBO first so that it can be operated on, and for a complex scene that causes a > 20ms performance hit.

Only in one specific scenario. And I use a number of FBOs to render frames, with NVidia, on Linux (for many years), on GTX580, GTX480s, GTX285s (and others) just like you are.

The only time I’ve seen anything like this is when you’re hitting up against (or flat blowing past) GPU memory capacity. When you do, that means the driver can/will start tossing textures and such off the board to try to make room so it can keep everything it needs for rendering batches on there, and that can result in massive frame time hits as it tries frantically to play musical chairs with CPU and GPU memory to render your frame. This includes your shadow textures, which may be swapped off the board to make room for other things when you’re not rendering to them.

So check how much memory you’re using. Use NVX_gpu_memory_info. It is trivial and well worth your while. In my experience, you should never see the “evicted” number > 0 (on Linux). If you do, you’re blowing past GPU memory. Shut down/restart X via logout/login or Ctrl-Alt-Bkspc (or just reboot) to reset the count to 0.

Also, if you’ve got one of those GPU memory and performance wasting desktop compositors enabled, disable it (for KDE, use kcontrol GUI to disable effects/composting, or just Shift-Alt-F12).

As far as controlling which get kicked off first, glPrioritizeTextures is generally mentioned as a no-op. And while NVidia hasn’t updated their GPU programming guide in a good while (3 years), we might have some clue as to how to influence texture/render target GPU residency priority through advice there (see below). But best advice, just never fill up GPU memory and then you don’t have to worry about this.

The problem indeed seems to be using a single FBO. As a test, I switched to using another FBO for shadow map rendering (switching between shadowmaps and the main view is the most frequent rendertarget change for me), and most of the “unexpected” performance hit went away. The rendering as a whole is still some constant factor slower than on Windows & OpenGL, but it’s much more consistent now.

Now just to implement the multiple-FBO mechanism properly and transparently to the caller :slight_smile:

Thanks to all who replied!

Over the course of a frame, in rebinding different render targets to the FBO, do you ever change the resolution and/or pixel format of the FBO?

I don’t know if it still is, but used to be that this was a slow path in the NVidia driver (circa GeForce 7 days). And yeah, the solution was to avoid doing that – use multiple FBOs.

Yes, the shadow maps (or possible post-processing buffers) are different size and format.

Would it possibly be that the Linux driver is still using older code?

The implication of your statement is that there’s a newer improved version. However, IIRC from the NVidia post, it’s not that this a path that was written inefficiently, but just that it’s a slow path. It said that reconfiguring the resolution or internal format of an FBO was expensive, and to avoid doing that a lot.

I can confirm further improvement (on Linux) by implementing a map of FBO’s, where the resolution and format form the search key.

However, what is curious that on Windows performance was always fine with the same hardware, and it did not improve over the initial code, which was just binding all surfaces to the same FBO.

That is interesting. Wonder if the Windows driver is doing the FBO virtualization thing under the covers that we’re both doing in the app.

Perhaps the Windows driver has to have a similar mechanism anyway to support Direct3D’s SetRenderTarget / SetDepthStencil API efficiently, and it just reuses it for OpenGL.

That would mean the FBO in fact exposes functionality closer to the hardware, while the Direct3D rendertarget setup is a further abstraction.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.