What is the best mix of CPU/GPU optimization?

I’m writing an application specific 3D engine for PCs using relatively recent OpenGL hardware. The application requires scenes containing rougly a million polygons. I’ve written many OpenGL apps in the past but never one with an optimized rendering pipeline.

When I write a nice scene graph that organizes polygons based on state (textures/materials) and then depth, then occlusion culled and/or frustum culled (LOD is next on my list) - I can’t figure out a way to take advantage of display lists that I’ve used in the past for a performance increase, except at the lowest level meaning one DL for each polygon (which seems like a waste) .Also I need to transform the polygons using the CPU to build the optimized list or graph.

Because the application is interactive, every frame has a different viewpoint and potentially more or less polygons in the scene - and therefore potentially vastly different polygons make it through the pipeline for display on each frame.

Is there a suggested best practice for deciding how much optimization to do on the CPU vs. OpenGL comaptible GPU?

Am I missing some feature that helps make better use of the graphics hardware when building an optimized list/graph of polygons to finally render?

Thank you.

In general, the idea is to make the GPU do as much work as possible, to a limit.

Try to build your meshes in strips as long as possible (using degenerate strips where needed). Occasionaly, you can break this rule for culling purposes, but only if the strip is running for a significant distance.

Do not cull polygons; cull objects. This means don’t waste time culling anything less than 500-to-1,000 polys or so. In general, try to avoid having independent meshes that small. Coarse culling is the key here; don’t bother with fine culling.

State changes are your principle bottleneck. Whether it’s shaders or regular GL state, any state changes will impose a penalty on your performance.

Second to state changes are batches of primitives. You’re going to be relying on VBO, so try to avoid doing lots of buffer binds/glPointer calls. Because the VBO itself isn’t in real use until you call a glPointer function, glPointer calls are worse in performance than binds, so try to use indices to decrease the number of glPointer calls you use. Try to render long strips, and use glDrawRangeElements where applicable. Consider glMultiDrawElements as well, though this is less important. Keep your vertex attribute data as small as possible, but remember that ATi has some pretty strict alignment requirements (components need to start on 4-byte boundaries, etc).

Do not do software T&L, except maybe for matrix palatte skinning (and probably not even then). Your graphics card will be perfectly happy to do T&L for you.

Thank you for the response. You’ve given me many good idea to run with. I never thought until after your post just how many things can be built out of long strips.

I also didn’t realize until now that I can do coarse culling with bounding spheres using just a world object position and a bounding sphere radius without the need to use a software transform as part of the culling (which I was going to have to do with individual polygons).

Thank you!

You’ll definitely need to software transform the center of your sphere for your culling purposes. And if you have some scale in your matrix, then the radius will have to be transformed by CPU too.

And again and again, tri strips can either boost your performance, either decrease it. The perfect case would be to be able to detect if it’s a real improvement or not and then choose to use them or not.

In either cases, optimizing your pre T&L cache will optimize as much (wrong, even more) as tri strips, if you’re not T&L limited (and if you are, my hat’s off to you). You can use both of course, but as they both optimize vertex data upload bandwidth (and other stuff), you won’t benefit from both on that specific part.