I have recently wrote a test application to profile my OpenGL render engine performance.
The test application draw a mob model 1000 times, using a GLSL vertex shader for skeleton animation calculating. The output is correct, but the FPS is low.
After doing some profile, I found that a glGetInteger(GL_VIEWPORT, vp) cost me 15 millisecond, so I search the internet, and found someone said that:
glGetIntegerv(GL_VIEWPORT, vp) cause the CPU to wait for the gl command buffer to clear.
But after deleting this glGetIntegerv call, the FPS is still low, and this time I found a glGetFloatv(GL_TRANSPOSE_MODELVIEW_MATRIX, v) cost me 15 millisecond. This made me crazy.
Are there some tricky things about glGetXXX that I must be care of? Why it so stupid to cost 15 millisecond?
And I did not use multithread in this application, why it cause the cpu to wait for a trivial glGetXXX call?
1.Pentium4 D CPU 2.8G Geforce 7950GT driver 169.21 WinXP SP2
2.Core2 6300 1.8G ATI 1650XT driver 8.3 WinXP SP2
I don’t know if there’s a trick to make the call return faster (I doubt it). A good idea is avoid those calls whenever possible (nvidia and ati mention that in their performance papers).
However I am sure your CPU(s) can do it faster, and if you only need the viewport and the transpose MV the code is very simple and would run lightning fast. If you need inverse, it gets a bit worse but still much better than 15 ms. I think you can look for Gauss-Jordan elimination for computing the inverse of a matrix but you can probably also find source code in your language.
The answer s simple: the Get operations are slow and should not be used. There is no reason to use them in the first place anyway ( if you want performance ). You can track data you need (like matrices) yourself. The GPU has to wait for the call because all current operations have to be finished in order to get the correct state.
Could it be that the glGetXXX() you are calling implicitely implies a glFinish()? Thus the driver gets stuck until the GPU is back in sync with the CPU. In my own engine, I record all my matrices & render states locally so I never have to call any glGetXXX() function. Also, driver calls are bad for CPU performance :o)
In my experience this cost was introduced when nvidia put their driver in it’s own thread. Every GL call is pushed onto the driver threads queue, including all glGet’s. So you’ll be waiting for all pending commands to be flushed before it gets to your glGet, at which point the result will be passed back to the apps blocked thread. Only then can you continue to add more GL commands to the driver queue, meanwhile the GPU is completely idle.
indexed color mode
single buffered rendering
gluBuildMipmaps (anything glu* should not be used in production anyway)
feedback selection mode
pbuffers (use FBO instead)
pre-shader systems such as texture combiners, etc.
anything not VBO for vertex operations
anything not PBO for pixel operations
No, I can confirm that glGetIntegerv(GL_VIEWPORT, vp) at least is prohibitively slow. I was trying to make a simple gui library that could be dropped into any GL app completely transparently. Alas, it was not to be - the user has to pass this information to the GUI library whenever it changes the viewport. What a shame
It sounds like you are using Vertical Synchronisation.
nope, I turn off Vertical Synchronisation by wglSwapIntervalEXT(0), and if I draw only 10 mob models the FPS can reach upto more than 1000 on my mathine. If I draw 1000, the FPS drops down to 21 and the SwapBuffer cost 15 ms.
So how would you create mipmaps then?
I create model mipmap texture by feeding the each mipmap level image data with calling,
glCompressedTexImage2DARB(GL_TEXTURE_2D, i, srcImageFormat, nWidth, nHeight, 0, psize[i], pdata[i]);
And I am sure no one call gluBuild2DMipmaps
I add a glFinish call just before SwapBuffer and find that glFinish cost 15 ms now. It seems that GPU is slow calculating bone anim vertex shader asynchronously, but when call glFinish, CPU wait for it.
I also test this app using my DirectX render engine and draw the same 1000 models, it can reach 30 FPS. It seems DirectX does not have this 15 ms problem, so it can reach 1000 / (1000 / 21 - 15) = 30 FPS