Avoiding "round-trip" API calls and performance

It is said that calls such as glGetFloatv, glGetIntegerv, glIsEnabled, glGetError, glGetString require a slow, round trip transaction between the application and renderer. Currently i’m doing frustum culling with spheres and before rendering each object i extract modelview matrix from OpenGL

glGetFloatv( GL_MODELVIEW_MATRIX, &CurrentModelView );

(only modelview, since projection is changed outside of the main loop)

Is it ok in general or is it better to track modelview matrix in the application (by multiplying transformation matrices)?

I would do all the matrix stuff myself and load it to GL via LoadMatrix. The GetFloat will almost certainly stall the CPU.

I have about 100 objects in the scene and the scenegraph transforms hierarchy for each object could be up to 3 matrices. This will result in 300 matrix multiplications. Is it still ok? I’m realy worrying about it.

represent your rotations as a quaternion, your translation as a vec3f, and your scale as vec3f.

well those multiplications were done before, just by GL?
when you do yourself just add some sort of dirty flag, so that if stuff remains static, you dont recalc everything each frame, that would be actually faster than relying on the gl Matrices…

I guess that matrix operations in the driver are done on CPU anyway, so you don’t lose anything. 300 matrix multiplications is nothing, modern CPUs can do so much more :slight_smile:

I actually don’t believe that these “get” functions stall anything. I mean, they are just retrieving CPU state. (occlusion queries excepted)

Obviously a local math class would be faster, but for 100 objects I would not worry about it.

Yes, but IMHO they probably must wait for pending operations to complete. If you issue several matrix operations, glGet will return the result of the last one, but this is not guaranteed to complete at that point. I can imagine that glGet calls glFinish

Originally posted by CrazyButcher:
well those multiplications were done before, just by GL?

Yep. Packed the matrices into the stack with glPushMatrix(), glMultMatrix() and then at the proper points extracted a modelview matrix for the current object

For performance, avoid ever asking GL questions. Strive to make all of your calls to GL, the ones that return ‘void’.

If that means foregoing conveniences such as the matrix stack, and potentially having to keep shadow copies of certain bits of state that you sent to GL so you can recall them cheaply later, then that’s what you have to do.

I imagine that most glslang hardware no longer has an actual matrix stack in hardware (if any implementation ever did have one). So asking for the matrix probably won’t hurt anything.

You could always profile it to see for yourself.

If a display list is somehow queued in the pipeline then getting a current matrix, or any state, is going to have to call glFinish. Virtually any state can be stored in a display list.

Calling glFinish is not necessary for the driver. The driver could just keep track of the matrices in a large array on the CPU side. The problem is that glGet functions are simply slow.

If you are going to use your own matrix code, it better be SSE optimized.

Also, D3D has something called a pure device. The D3D layer doesn’t store any state and any calls to Get functions are illegal. It is for better performance.

With my own transformations stack implemented i’ve got about 5-7% increased FPS but GPU idle dropped to almost zero and driver sleep time went up to 20-23 ms. Seems i still have problems somewhere.

Originally posted by Sergey K.:
With my own transformations stack implemented i’ve got about 5-7% increased FPS but GPU idle dropped to almost zero and driver sleep time went up to 20-23 ms.
It is possible that the GPU speed is now the bottleneck.

GPU idle dropped to almost zero
That’s usually a good sign. Now you are rendering at full speed without any nasty bottlenecks at the CPU or data submission level :wink:

Chances are that the driver sleep time will decrease when you give the CPU more to do. The driver does only queue a few frames, after that it’ll sleep in SwapBuffers. This sleep time is freely available to do what you like on the CPU.

To further increase performance, you have to look for bottlenecks on the GPU side (geometry, fragment, …).

Originally posted by V-man:
Calling glFinish is not necessary for the driver. The driver could just keep track of the matrices in a large array on the CPU side.
Yes it could run through the display list at compile time and store the resultant matrix state.
But I can’t see it being optimized enough to do that to avoid the special case of someone calling glGetFloatv. It is more likely it would call glFinish for every get.

Why is it that nobody (besides Mesa3D) has a nice, GL-compliant, open-source (MIT or BSD license) matrix stack implementation out there?

Originally posted by Sergey K.:
With my own transformations stack implemented i’ve got about 5-7% increased FPS but GPU idle dropped to almost zero and driver sleep time went up to 20-23 ms. Seems i still have problems somewhere.
Sorry for the offtopic, but how did you measure this parameters?

Using nvAPI and NVPerfSDK from nVidia.

Look here: http://developer.nvidia.com/object/nvperfsdk_home.html