SwapBuffers taking 96% of my rendering time


I am getting only 29 FPS for my OpenGL application on Windows XP, SP2 on P4 2.6GH, on NVIDIA FX5200. This code is based on an infinite loop that (tries to) generates 1 rendering event every 20 msec (with sleep() in between). I turned vertical sync OFF via NVIDIA control panel, and here are the time measurements during 1 second:

Misc: 1 usec (microsec)
Picking: 0 usec (I wasn’t moving mouse)
Rendering: 30378 usec
Timer use: 25 usec
Swap time: 954166 usec
(the other 1543 usec are outside my measuring block)

Does anyone have any guesses as to why SwapBuffers() might spend so much time?

Here are some other strange things:

  1. The same code runs fine and yields 49FPS (as intended) on another similar class of PC.

  2. When I move focus from my GL application to another window (like firefox that i am using right now), the swap time goes down to about 4800 usec, and frame rate goes up to 32 FPS. (I noticed that I still have some timer skew of 360 msec, which eats up 30% of my 1 second, (about 20 missed frames…) This doesn’t happen with my other machine, where timer skew is only tens of msec.

Thanks in advance!!

By the way, the window size is 1024 x 768. One thing that I noticed is that “the bigger the window size, the lower the frame rate becomes”. This was what was happening when I obscured a portion of the window with firefox.

So I tried using WGL_SWAP_METHOD_ARB set to WGL_SWAP_EXCHANGE_ARB option in my pixel format, and this doesn’t seem to help either. (and also PFD_SWAP_EXCHANGE).

I’ve seen higher frame rate on my machine with a different software… What could I be doing wrong?

Not sure this will help, but for this scenario you should probably do:
glFlush (<- vital to do before sleeping)

I’ve on ATI seen results with swap that gets worse, way worse, the closer to the bottom edge of the screen the window is. I don’t know if nvidia have similar issues.

SwapBuffers takes most of the time because with OpenGL, the rendering commands don’t actually do anything. The rendering is only enqueued on the GPU, and the GPU processes the queue.

When you do a SwapBuffers, you have to wait for an earlier frame to finish drawing.

When the frame rate varies depending on the window size, this means you are fill rate limited. The fragment processor on the GPU is the slowest part of the whole pipeline (or the part that has most work to do in your scene :wink: ).

Of course this means that any faster part of your system (for example the CPU) has to wait for the fragment processor to catch up, and this is happening in SwapBuffers.

Hi advil_user:

You can use gDEBugger to help you find the bottleneck.

gDEBugger profiling views contains performance counters graphs of Win32, NVIDIA, etc; including: CPU/GPU idle, graphic memory consumption, vertex and fragment processors utilizations, …

The Performance Analysis Toolbar enables you disable stages of the graphics pipeline one by one. If the performance metrics in the profiling views improves while turning off a certain stage, you have found a graphics pipeline bottleneck!

A 30 days trial version of gDEBugger comes with NVIDIA’s NVPerfKit. It is available at: http://developer.nvidia.com/object/nvperfkit_home.html

Let us know if you need any further assistance,

The gDEBugger team

Hov exacly are you using the sleecp function? If you do something like sleep(20) it will end up slow anyways.
Also, issue glFinish to get correct measurements(as already said).

Shameless plugs for commercial products aside, are you using GetTickCount() at all?

I am not using GetTickCount() but I am using QueryPerformanceCounter(), before and after each region that I am measuring.

I did place a glFinish() before swap buffers, and indeed this seems to be the thing that eats up most of my time (96%+).

Here is what I am doing:

  • Create compressed texture from 1024 x 1024 x 32bit image.

On each rendering:

  • Map a portion of the texture to a 1024 x 768 window as background with multi-sample OFF.
  • Drew 3 semi-transparent rectangle on top of this background, and some of them displaying some text, with multisample ON.

Does this configuration sound like something that would take so much time that glFinish time would allow only 30FPS?

I’ll do some more expriment with my other machines…

On each rendering:

  • Map a portion of the texture to a 1024 x 768 window as background with multi-sample OFF.
  • Drew 3 semi-transparent rectangle on top of this background, and some of them displaying some text, with multisample ON.

Does this configuration sound like something that would take so much time that glFinish time would allow only 30FPS?
Yes. Your card, nvidia FX5200, is the kind of card that has quite a lot of features but very limited memory bandwidth. You are clearly fill limited.

Try to see if you can replace the semi-transparent quads with text, by alpha tested text.
More details or a screenshot of what you want to render may help us to find solutions.

But it goes down to :

  1. buy a faster card.
  2. or lower your rendering resolution/number of passes/disable blending.

A fx5200 is capable of 1.3billion texel/s .

I wouldn’t think you could draw that many in a 1024x768 window with 4 quads at 30fps - unless you’re using some fairly high level multisampling.

Advil_user, do you have multisample turned on at all? If so what level? What if you turn it off (ie. Don’t create a context with multisample)?

Sleep call can be a reason for slowdowm. Depending on underlaying hardware, OS and drivers, Sleep(1) may take from 5 - 30 ms. Better use multimedia timers.


I think I am creating multisample rendering context with 4 taps (although I need to verify that on my home PC later tonight).

I have multisample ON when I draw characters, but not when I am mapping texture to 1024x768
background. Let me try not printing any characters and see if I get the same refresh rate. (again will try this later at home… :slight_smile: )

I think I understand my problem better now.

It seems that multisampling is especially taking toll on my graphics card (NVIDIA Fx5200).

Regardless of the pixel format I choose (even if I choose one that supports multisample), if I turn multisample off, and just use GL_POLYGON_SMOOTH, I can easily get 50FPS on my home PC displaying the same text as intended.

Another note of interest is that on my home machine (with NVIDIA FX5200), anti-aliasing quality of text I am displaying with just GL_POLYGON_SMOOTH is almost as good as one produced by multisampling.

It seems that I can live with just disabling multisample support on my home machine.

On the other hand, there is a visible difference between GL_POLYGON_SMOOTH and multisampling on my work machine… it seems that that ATI card is better at doing multisampling than polygon smoothing.

By the way, I changed my code to use tamlin’s render, glFlush, sleep, swap suggestion, and this seems to be a rational approach!

Thanks for all the replies on this thread.

Good luck zbuffering. Multisample is essential for depth tested AA.

You should also realize that time taken to swap may include idle time waiting or the next refresh or blocking on stuff (swap blocks if there’s another swap pending) in the command queue to clear.

So you may not be measuring what you’re actually drawing, remember multisample also has more area to fill depending on the implementation and so going to alpha based antialiasing you may have reduced the pixel fill requirements considerably, eliminating a big part of why you were blocking on swap and changing your numbers considerably.

I doubt you’ve fully understood what’s happening or appreciate the implications of going with polysmooth.