Triangle strip VS list, and other

I have been testing different draw primitive method with a 18000 polygon mesh. I get about 2000 triangle strip from the mesh. I used two different ways, one is a single glDrawElements(GL_TRIANGLES…) to draw all the polygons, the other is a 2000 times loop with glDrawElements(GL_TRIANGLE_STRIP…)
I disabled vertical-sync of my NV5700, so the frame rate will not be limited to my monitor refresh rate. The method using triangle list got about 780fps, and the one with triangle strip only got 70fps. Is the cost of calling GL API that bad?

And another interesting result I found. I decided to choose triangle list to draw primitives. With all those strip vertex indices I already got, I came up with an idea that build triangle list by the polygon order from the strip. Since neighbour triangles in a strip share two vertices, I think maybe in this way the list will be more effecient cuz some vertices can be cached. But the result is not what I think. The rebuilded triangle list only got 560fps, slower than the triangle list directly dumped from original model data. I wonder why this happens.
It is quite interesting benchmarking those methods. I hope someone can give me more advice.

The reason for your performance drain is that on 1GHz CPU you can call around 500 draw calls without stalling the GPU and your have around 2000 draw calls which is pretty bad, so batch as much as you can into fewer draw calls.

The reason for your performance drain is that on 1GHz CPU you can call around 500 draw calls without stalling the GPU and your have around 2000 draw calls which is pretty bad, so batch as much as you can into fewer draw calls.

Zodiac, this sounds like taken from a statement about D3D about CPU limitations. OpenGL works differently.

Nil_z, there are two ways to remove the for-loop. Either use glMultiDrawElements
( )
or even better use glPrimitiveRestartIndexNV ( ) to specify one unused index as restart command for the next triangle and put all indices in one list for one single glDrawElements call. If you know your range of indices use glDrawRangeElements.

One small additional remark: By specifying the same index twice or three times (which will result in an empty triangle) you can also glue different strips together withour using the proposed glPrimitiveRestartIndexNV extension, which might not be available on other graphics cards than NVidia.

You only have to take care about the correct number of empty triangles to ensure that the the attached strip will have the correct winding direction which is important for culling.


I think the difference lies elsewhere. There are mainly two types of cache on a gpu. First one (pre-tranformation) caches VBO memory, second one (post-transformation) caches transformated vertex datas. It depends on the order of vertices and the complexity of the vertex program, which cache helps you more.

For example: you create a plane with n*m vertices. Let the order of vertices in the VBO be n rows, all of them containing m vertices. (array[m][n]) If you draw rows with strips, then first cache is going to be your friend, because gpu reads more than one vertex data at a memory access, and there is much possibility that the read datas in that block will be used soon. The other case is drawing columns with strips. The difference of vertex-indices are going to be huge in this case (values around m), and you’ll need much more memory access to draw the plane. A sphere is a typical object, where second case is easier, but slower.

Of course n*m plane is a very simple example. If you do complex vertex porgrams, more will rely on post-transformation cache. If you don’t, it’s better to optimize on pre-transform cache in most cases.

i have got another interesting result. Stitching all strips together by NvTriStrip lib, i got one big strip about half the length of the triangle list(29000 vs 56000), but the fps is still lower than the triangle list version(660 vs 760). I can’t understand why.
the PrimitiveRestartIndexNV method gives me only about 180fps, i am completely lost now…

Jimmiwalker2, i don’t know what you are talking about. As far as i know, memory in the GPU is uncached, at the contrary of the CPU. There is a cache for post-transformed vertices, but you can only make use of it if you use indexed primitives. The complexity of a vertex program has nothing to do with how efficient the cache is.


Have you tried different cache sizes in nvtristrip?

Have you used the remapindices function in nvtristrip?

It might be worth uploading the benchmark so we can try it out. If you have no where to upload it to email it to me, my address is in my profile.

What hardware are you using?

to Ysaneya: Please read another thread about vertex cache. You will find out that there is another cache, the pre-cache which mirror the memory.