I try to write a viewer for 3D scan data, but when I load a high-resolution scan, it’s very slow.
I’m using one VBO for the complete scan, one FBO with multiple render targets (2 at the moment), shader for lightning and picking. And I don’t know where the problem or bottleneck is or isn’t it possible to render 2 million triangles in real-time without using some optimization techniques like Level of Detail objects or special storing structures in the vbo (fan/strips).
My question is what are the main bottlenecks? So that I can check if I have some bottleneck in my program. Or is there a special OpenGL state which I can activate with glEnable(GL_FAST_RENDERING)
I’ve read something and I think the problem could be:
no Level of Detail objects
to huge data for one VBO
to many API calls for VBO using
bad structure in the Vertex Array
Before I change to many code I want to know your opinion if I am on the right way to solve the problem or maybe I forget one important step.
When I reduce the size of the window there is no performance boost.
I know that 2 million triangles are a lot, but I can display the object with programs like Geomagic in realtime. My test system has a Gefoce GTX 275. And a colleague has written a program with DirectX and XNA and has no problem to display 2 million triangles.
For the 2 million scan my program needs about 500 ms.
I think either there has to be a huge bottleneck or XNA/Geomagic use some optimization techniques.
Because I want that every VBO can use different attrib arrays, for example one VBO use only Vertex and Color Array and another uses NormalArray (I don’t put that to the code, because that would make it to complex).
I’ve tried display lists, but there wasn’t a better result.
I’m using C#, but I don’t think, that using such VM languages like java or C# isn’t the problem, because my colleague also uses C#.
Why not change your shaders to simple pass-through shaders and see if that has any effect on performance? At the very least it will help a lot with isolating potential causes of performance bottlenecks here.
Agreed that the EnableClientState/DisableClientState calls should go outside the for loop, but I don’t think they’re going to be a huge issue here (although that largely depends on how many times you’re going through the loop).
I’m a little suspicious about that “if” in the fragment shader too; that could be moved to the vertex shader or possibly removed altogether which wouldn’t hurt either way. Your IAmbient calculation could also move to the vertex shader or preferably become a uniform - that’s unnecessary overhead.
Yes, definately. Vertex pointer and so on are active for the last bound vbo. Doing like you do result in unknown behaviour:
The array’s buffer binding is set when the array pointer is specified. Using the vertex array as an example, this is when VertexPointer is called. At that time, the current array buffer binding is used for the vertex array. The current array buffer binding is set by calling BindBufferARB with a <target> of ARRAY_BUFFER_ARB. Changing the current array buffer binding does not affect the bindings used by already established arrays.
VertexPointer(…); // vertex array data points to buffer 1
BindBufferARB(ARRAY_BUFFER_ARB, 2); // vertex array data still points to buffer 1
My suggestion is to actually disable the shaders and see what happens.
Another test I would try is to use a small mesh for tests purposes, like the Stanford Bunny or Dragon. Another thing is to disable vertical sync in order to let OpenGL render as much frames as it can. Even with such a small frame rate, I would try to disable vsync to be sure this is not a bottleneck.
When I disable the shader, the program runs a little bit faster, but nut fast enough. The scene is still jerking.
For testing I’m using some smaller meshes (40.000-400.000 triangles) and there the program is fast enough.
Disable VSync has also no effect.
You’re sending full float for everything including rgba twice!!.
There is no indication you using indexes in any way. You’re just using a draw arrays.
If this is a poly soup model (triangles not tristrip) you’re even worse off because primitive type is wasteful, in your case it’s data dependent (that’s a bad thing because some data will be really slow in your software and faster in better software).
Tristrip will get you 3X performance, indexes exploiting cache coherence can easily double that or more on big meshes. Not sending a whole load of data per vertex you don’t need as full float will give you a further possible bandwidth boost but it depends at that time where you’re bottlenecked.
Currently you have very basic “get it on the screen” code. You can easily get 3x - 6X (or more) the performance through indexed cache coherent tristrips and data packing improvements.
Index it, rationalize the verts and send it through nvtristrip then re-sort and re-index for VBO access order and render it with drawelements. Then clean up your packed vertex data to reduce vertex in-memory size.
Most of the suggestions in this thread miss the core problem with your graphics code. You WILL get major improvements implementing my suggestions. Fiddling around elsewhere will have only a marginal impact on performance. Unfortunately restructuring to incorporate an indexed tristriper takes a bit of work and incurs a startup cost for optimizing the data.
P.S. it has been observed that simply sorting indexed triangles to exploit cache residence can have a massive performance boost and may even be faster than tristripping due to fewer ordering constraints. You could index, sort for adjacency (vertex order cache residency) and sequencing and draw as indexed triangles. This is all the more likely to work for you because your per-vertex data payload is massive compared to the index overhead.
YMMV from platform to platform though depending on implementation, indexed tri strips may still be preferred on some targets.
P.P.S. “BeginMode.Triangles” suggests you’re drawing this as triangles (I first assumed this was data driven but it’s probably a C++ wrapper definition of GL_TRIANGLES), so this confirms that the code is doing at least 3X the work it needs to and a lot worse as described above. All advice above still stands, this just reinforces the observation w.r.t. primitive type.
The problem is your primitive type with DrawArrays rather than DrawElements with cache coherent indices and all the implications that has for the way you need to structure your data to make the latter work. You will need to dispatch with DrawElements and tristrip with a cache coherent tri-stripper like nvtristrip. Secondarily your per vertex payload is unnecessarily large storing 2 vec4 floats purely for color. i.e. 32 bytes per vertex just for color.
VBOs are simply a graphics storage mechanism that hints at graphics residence and/or non volatility to the driver after dispatch. Useful and important, but I think OK in your code.
Now I implement an algorithm to delete duplicate vertices and create a IndexBuffer and the program runs really fast. But know my picking doesn’t work anymore. I’m using color picking and so I have to render every triangle in a unique color. I’ve used the secondary color for this information, but know I can only hold one color information for one vertex. So how can I render every triangle in a unique color with an IndexArray?
I could make it with the shader with some bit shifting or a 1D texture, which holds the Index coded as an RGBA value.
Or is there a better way?