glGetXXX cost 15 millisecond! Why?

OpenGL OpenGL: Advanced Coding

linghuye March 10, 2008, 4:53am 1

I have recently wrote a test application to profile my OpenGL render engine performance.
The test application draw a mob model 1000 times, using a GLSL vertex shader for skeleton animation calculating. The output is correct, but the FPS is low.

After doing some profile, I found that a glGetInteger(GL_VIEWPORT, vp) cost me 15 millisecond, so I search the internet, and found someone said that:
glGetIntegerv(GL_VIEWPORT, vp) cause the CPU to wait for the gl command buffer to clear.

But after deleting this glGetIntegerv call, the FPS is still low, and this time I found a glGetFloatv(GL_TRANSPOSE_MODELVIEW_MATRIX, v) cost me 15 millisecond. This made me crazy.

Are there some tricky things about glGetXXX that I must be care of? Why it so stupid to cost 15 millisecond?
And I did not use multithread in this application, why it cause the cpu to wait for a trivial glGetXXX call?

Test environments
1.Pentium4 D CPU 2.8G Geforce 7950GT driver 169.21 WinXP SP2
2.Core2 6300 1.8G ATI 1650XT driver 8.3 WinXP SP2

Any tip would be appreciated.

Nicolai_de_Haan March 10, 2008, 5:19am 2

I don’t know if there’s a trick to make the call return faster (I doubt it). A good idea is avoid those calls whenever possible (nvidia and ati mention that in their performance papers).

However I am sure your CPU(s) can do it faster, and if you only need the viewport and the transpose MV the code is very simple and would run lightning fast. If you need inverse, it gets a bit worse but still much better than 15 ms. I think you can look for Gauss-Jordan elimination for computing the inverse of a matrix but you can probably also find source code in your language.

NiCo1 March 10, 2008, 5:42am 3

You can try calling glFlush() once in a while e.g.

for (unsigned int i=0;i<1000;++i)
{
drawModel();
glFlush();
}

Zengar March 10, 2008, 6:09am 4

The answer s simple: the Get operations are slow and should not be used. There is no reason to use them in the first place anyway ( if you want performance ). You can track data you need (like matrices) yourself. The GPU has to wait for the call because all current operations have to be finished in order to get the correct state.

despoke March 10, 2008, 6:23am 5

Could it be that the glGetXXX() you are calling implicitely implies a glFinish()? Thus the driver gets stuck until the GPU is back in sync with the CPU. In my own engine, I record all my matrices & render states locally so I never have to call any glGetXXX() function. Also, driver calls are bad for CPU performance :o)

knackered March 10, 2008, 6:37am 6

In my experience this cost was introduced when nvidia put their driver in it’s own thread. Every GL call is pushed onto the driver threads queue, including all glGet’s. So you’ll be waiting for all pending commands to be flushed before it gets to your glGet, at which point the result will be passed back to the apps blocked thread. Only then can you continue to add more GL commands to the driver queue, meanwhile the GPU is completely idle.

CatDog March 10, 2008, 10:48am 7

Nice. Where is that official list of deprecated OpenGL commands? I’m reading this twice a week here: “should not be used (anymore)”.

CatDog

ZbuffeR March 10, 2008, 11:00am 8

List of deprecated OpenGL features :

indexed color mode
single buffered rendering
gluBuildMipmaps (anything glu* should not be used in production anyway)
feedback selection mode
accum buffer
pbuffers (use FBO instead)
pre-shader systems such as texture combiners, etc.
anything not VBO for vertex operations
anything not PBO for pixel operations
glGet*

Anything left ?

CatDog March 10, 2008, 11:07am 9

Nice again. Now someone please go to that “SDK” and add a

** DO NOT USE **

tag to all related entry points.

CatDog

system March 10, 2008, 11:17am 10

Accum buffer is alright. It has been hw accelerated for a few years now.
I would add glDrawPixels, glBitmap, wglFontBitmaps, glaux.

NiCo1 March 10, 2008, 11:21am 11

I would add parts from the imaging subset (histogram,convolution,colormatrix?)

mfort March 10, 2008, 11:27am 12

I am sure that not all Get commands are slow.
I was not able to use push attrib to store current FBO bound. So I am using:
glGetIntegerv(GL_FRAMEBUFFER_BINDING_EXT, &oldFrameBuffer);

and tested it quite a lot to be sure it doesnt slow down the rendering. Found out it takes 0.003ms measured by performance counters.
(NVidia, WinXP)

knackered March 10, 2008, 12:39pm 13

No, I can confirm that glGetIntegerv(GL_VIEWPORT, vp) at least is prohibitively slow. I was trying to make a simple gui library that could be dropped into any GL app completely transparently. Alas, it was not to be - the user has to pass this information to the GUI library whenever it changes the viewport. What a shame

thinks March 10, 2008, 5:09pm 14

glGetXXX can be quite useful at startup to query the driver for constants (and, of course, extensions). But at that stage things are not really performance bound…

Nicolai_de_Haan March 10, 2008, 6:24pm 15

a) Don’t rely on GL_PROXY_TEXTURE
b) I would remove “anything not VBO…” but add “immediate mode for vertex operations” (because DL can be the right solution in some cases)

linghuye March 10, 2008, 6:52pm 16

Thank you so much indeed for all your replies. I learn a lot.
I will try caching the render state locally from now.

linghuye March 10, 2008, 10:32pm 17

I try removing all the glGetFloatv/glGetIntegerv and cache the matrices I need, but the call to

::SwapBuffers(m_hDeviceDC);

cost me 15 ms now, this 15 ms cost is just like a ghost haunting around this appliaction.

I try calling glFlush before SwapBuffers, no difference.

According to my profile data, each animated model draw batch cost only 0.004ms(just a multithread post?), and the final 15ms maybe the presentation of the multithread driver.

Any idea?

lodder March 11, 2008, 1:04am 18

It sounds like you are using Vertical Synchronisation.

So how would you create mipmaps then?

linghuye March 11, 2008, 2:19am 19

It sounds like you are using Vertical Synchronisation.

nope, I turn off Vertical Synchronisation by wglSwapIntervalEXT(0), and if I draw only 10 mob models the FPS can reach upto more than 1000 on my mathine. If I draw 1000, the FPS drops down to 21 and the SwapBuffer cost 15 ms.

So how would you create mipmaps then?

I create model mipmap texture by feeding the each mipmap level image data with calling,
glCompressedTexImage2DARB(GL_TEXTURE_2D, i, srcImageFormat, nWidth, nHeight, 0, psize[i], pdata[i]);
And I am sure no one call gluBuild2DMipmaps

I add a glFinish call just before SwapBuffer and find that glFinish cost 15 ms now. It seems that GPU is slow calculating bone anim vertex shader asynchronously, but when call glFinish, CPU wait for it.

I also test this app using my DirectX render engine and draw the same 1000 models, it can reach 30 FPS. It seems DirectX does not have this 15 ms problem, so it can reach 1000 / (1000 / 21 - 15) = 30 FPS

ZbuffeR March 11, 2008, 2:52am 20

So how would you create mipmaps then? [/QUOTE]

Hardware accelerated :
glTexParameteri( GL_TEXTURE_2D, GL_GENERATE_MIPMAP_SGIS, GL_TRUE );