glGetXXX cost 15 millisecond! Why?

I have recently wrote a test application to profile my OpenGL render engine performance.
The test application draw a mob model 1000 times, using a GLSL vertex shader for skeleton animation calculating. The output is correct, but the FPS is low.

After doing some profile, I found that a glGetInteger(GL_VIEWPORT, vp) cost me 15 millisecond, so I search the internet, and found someone said that:
glGetIntegerv(GL_VIEWPORT, vp) cause the CPU to wait for the gl command buffer to clear.

But after deleting this glGetIntegerv call, the FPS is still low, and this time I found a glGetFloatv(GL_TRANSPOSE_MODELVIEW_MATRIX, v) cost me 15 millisecond. This made me crazy.

Are there some tricky things about glGetXXX that I must be care of? Why it so stupid to cost 15 millisecond?
And I did not use multithread in this application, why it cause the cpu to wait for a trivial glGetXXX call?

Test environments
1.Pentium4 D CPU 2.8G Geforce 7950GT driver 169.21 WinXP SP2
2.Core2 6300 1.8G ATI 1650XT driver 8.3 WinXP SP2

Any tip would be appreciated.

I don’t know if there’s a trick to make the call return faster (I doubt it). A good idea is avoid those calls whenever possible (nvidia and ati mention that in their performance papers).

However I am sure your CPU(s) can do it faster, and if you only need the viewport and the transpose MV the code is very simple and would run lightning fast. If you need inverse, it gets a bit worse but still much better than 15 ms. I think you can look for Gauss-Jordan elimination for computing the inverse of a matrix but you can probably also find source code in your language.

You can try calling glFlush() once in a while e.g.

for (unsigned int i=0;i<1000;++i)

The answer s simple: the Get operations are slow and should not be used. There is no reason to use them in the first place anyway ( if you want performance ). You can track data you need (like matrices) yourself. The GPU has to wait for the call because all current operations have to be finished in order to get the correct state.

Could it be that the glGetXXX() you are calling implicitely implies a glFinish()? Thus the driver gets stuck until the GPU is back in sync with the CPU. In my own engine, I record all my matrices & render states locally so I never have to call any glGetXXX() function. Also, driver calls are bad for CPU performance :o)

In my experience this cost was introduced when nvidia put their driver in it’s own thread. Every GL call is pushed onto the driver threads queue, including all glGet’s. So you’ll be waiting for all pending commands to be flushed before it gets to your glGet, at which point the result will be passed back to the apps blocked thread. Only then can you continue to add more GL commands to the driver queue, meanwhile the GPU is completely idle.

Nice. Where is that official list of deprecated OpenGL commands? I’m reading this twice a week here: “should not be used (anymore)”.


List of deprecated OpenGL features :

indexed color mode
single buffered rendering
gluBuildMipmaps (anything glu* should not be used in production anyway)
feedback selection mode
accum buffer
pbuffers (use FBO instead)
pre-shader systems such as texture combiners, etc.
anything not VBO for vertex operations
anything not PBO for pixel operations

Anything left ?

Nice again. Now someone please go to that “SDK” and add a

** DO NOT USE **

tag to all related entry points.


Accum buffer is alright. It has been hw accelerated for a few years now.
I would add glDrawPixels, glBitmap, wglFontBitmaps, glaux.

I would add parts from the imaging subset (histogram,convolution,colormatrix?)

I am sure that not all Get commands are slow.
I was not able to use push attrib to store current FBO bound. So I am using:
glGetIntegerv(GL_FRAMEBUFFER_BINDING_EXT, &oldFrameBuffer);

and tested it quite a lot to be sure it doesnt slow down the rendering. Found out it takes 0.003ms measured by performance counters.
(NVidia, WinXP)

No, I can confirm that glGetIntegerv(GL_VIEWPORT, vp) at least is prohibitively slow. I was trying to make a simple gui library that could be dropped into any GL app completely transparently. Alas, it was not to be - the user has to pass this information to the GUI library whenever it changes the viewport. What a shame :wink:

glGetXXX can be quite useful at startup to query the driver for constants (and, of course, extensions). But at that stage things are not really performance bound…

a) Don’t rely on GL_PROXY_TEXTURE
b) I would remove “anything not VBO…” but add “immediate mode for vertex operations” (because DL can be the right solution in some cases)

Thank you so much indeed for all your replies. I learn a lot.
I will try caching the render state locally from now.

I try removing all the glGetFloatv/glGetIntegerv and cache the matrices I need, but the call to


cost me 15 ms now, this 15 ms cost is just like a ghost haunting around this appliaction.

I try calling glFlush before SwapBuffers, no difference.

According to my profile data, each animated model draw batch cost only 0.004ms(just a multithread post?), and the final 15ms maybe the presentation of the multithread driver.

Any idea?

It sounds like you are using Vertical Synchronisation.

So how would you create mipmaps then?

It sounds like you are using Vertical Synchronisation.

nope, I turn off Vertical Synchronisation by wglSwapIntervalEXT(0), and if I draw only 10 mob models the FPS can reach upto more than 1000 on my mathine. If I draw 1000, the FPS drops down to 21 and the SwapBuffer cost 15 ms.

So how would you create mipmaps then?

I create model mipmap texture by feeding the each mipmap level image data with calling,
glCompressedTexImage2DARB(GL_TEXTURE_2D, i, srcImageFormat, nWidth, nHeight, 0, psize[i], pdata[i]);
And I am sure no one call gluBuild2DMipmaps

I add a glFinish call just before SwapBuffer and find that glFinish cost 15 ms now. It seems that GPU is slow calculating bone anim vertex shader asynchronously, but when call glFinish, CPU wait for it.

I also test this app using my DirectX render engine and draw the same 1000 models, it can reach 30 FPS. It seems DirectX does not have this 15 ms problem, so it can reach 1000 / (1000 / 21 - 15) = 30 FPS

So how would you create mipmaps then? [/QUOTE]

Hardware accelerated :