I am working improving my existing application’s performance. It currently draws large CAD models in immediate mode using the fixed pipeline and I would like to increase the frame rate but I have had limited success. The bottleneck seems to be the million(s) call of glVertex, sending all the vertices and normals one by one. I am trying to find: 1) the best method in terms of performance, 2) the best short term option for improving performance but with a minimum amount of changes from immediate mode (not sure I am ready to rewrite all the graphics). Here is what I found so far, does this sound accurate and do you have any better suggestions? Thanks!
Display listing provided a decent frame rate increase for rotate/pan/zoom but it does not help for animation. I also noticed the application memory would go up significantly on machines with NVIDIA cards while the ones with ATI cards varied (some went up slightly and some seemed to duplicate the entire display list).
Using VBOs to push all the vertices and normals to the card in large blocks did not provide much of a performance gain over immediate mode. I only updated the buffers when the positions changed but because of the current graphics structure I had to use many (~2000) VBOs for a model with a little over a million primitives. I seemed to get the best results by only using a VBO/DrawElements only if the VBO would have at least 500 primitives, otherwise continue with immediate mode drawing for that set of primitives. The performance increase was only about 5% though.
The other problem I had with VBOs is that I needed to create one normal per vertex where I am currently using flat shading in immediate mode. I only generate the extra normals on start up but it increases start up time and memory usage. Is there a way to use VBOs and only keep one normal per primitive? Maybe something like writing a vertex shader that generates the additional normals?
If your vertex data does not change too often, using display lists would be the logical choice and it would require the minimum amount of effort. You could also expect the most gain in terms of speed.
However, if your vertex data changes often and as you wrote your models are huge, the only way to go is using VBOs. I do not quite understand why you have to use so many of them for one model and I definitely do not understand why the speed increase is so little. Something must be wrong there.
Thanks for the reply. I do not necessarily need to use that many VBOs but I chose to for testing to prevent major modifications to the existing structure. I wanted to hack in a prototype to try to see what kind of performance gain I could get but obviously I need to spend more time doing it correctly. I downloaded a sample app of a similar drawing scale that got much better results with VBOs on my system so it must be my implementation.
Is it better to create on one big VBO or several of them? Can I get around the one normal per vertex requirement and only store one normal per primitive?
Also, is it common for display lists to duplicate memory? Isn’t is suppose to be stored in graphics memory if there is enough available?
I don’t think there’s ever been that kind of guarantee. However, driver developers do weird juju with display lists and I’m sure some data gets moved over to the graphics card if possible on nVidia and ATI’s drivers.
What are you animating that requires changing the display lists? Are you deforming them with the CPU?
In theory it should be better to use less VBOs (less overhead) but the difference should not be too big. Your negligible speed increase compared to immediate mode cannot be explained by the big number of VBOs. Immediate mode by far is the slowest method to draw geometry, especially one with lots of vertices, etc.
Display lists are the best because they are stored in a format (only the developer of the driver knows what) that is the best for the hardware. But they are also inflexible. To change any data you need to recreate the whole display list.
VBOs also should store all data in GPU memory but you can map these memory buffers to make modifications. AFAIK drawing with VBOs should be only 10-15% slower than using display lists. And a lot lot faster than immediate mode.
I don’t think you can get around the one normal/vertex requirement with a VBO. You can use the same normal data for every vertex of the whole primitive to get flat shading. By primitive you mean every triangle or quad?
Display list can be faster in theory than VBO, especially accounting for all the state change that can not be precomputed on a VBO.
But that depend on the GL implementation, Nvidia DL being said to be the best, not so much the case for ATI.
Hmm, yeah, I wrote my UI system using display lists because it’s very heavy on state change (constantly switching images or changing blend mode or scissor zone for scrolling panels) and it seems fairly fast.
Thanks for the responses. I support the ability to animate where the vertex data changes once per frame. None the less, it looks like display listing would be a good start for a FPS performance improvement and I could fall back to immediate mode when animating. I’ll revisit VBOs after to see if I can improve animation performance.
When you get to that point, post back here to get some tips on making VBOs as fast as possible. There are ways to make them very fast, but the naive approach to using them is in my experience pretty slow … slower than client arrays.
But yeah, display lists on NVidia are a usually a good high-end measure for performance baselining, and give you something to shoot for. I would recommend only putting batch data, not state changes, in the display list. That is, use them as a “fast draw call”.
If flipping to display lists from immediate mode doesn’t radically improve your draw times, then more than likely you’re not bound by batch submission or batch rendering but by pipeline state changes or something else, indicating that you would be better to focus your optimization efforts elsewhere first (always optimize the biggest bottleneck).
The cons for using display lists for your final solution being that compiling them is expensive (may cause frame breakage in your app), they take up space on the GPU (like VBOs, and apparently only NVidia really excels with display list performance (or that’s the scuttlebutt around here at least).