I acheive this figure (on GF2 GTS) also, with simpler very-long triangle strips, compiled into a display list (instead of interleaved VAR). The numbers agree with the 250/200 core clock difference between GTS and ultra
But my GF2 GTS can do it with VAR. It’s still strange that using the same rendering mode (VAR), my GF2 (200 core) renders at >24MT/s, and your GF2U (250 core) only at 22MT/s…
What does CPU has to do with it? With VAR or display list, its being ‘read’ from video memory, not over AGP, when its drawn.
Well, the CPU still spends a lot of time in the driver with indexed primitives. I don’t know why, but it does. If you use glDrawArrays, this gets much, much less, but then can’t use the vertex cache. But you could try using glDrawArrays for the independent triangles with non-shared verts…
But my original benchmarks were using display lists, for exactly that reason. Matt’s dictum “use VAR” moved me off.
There are two sides to this. First, Matt or Cass once said that they are not storing geometry in video memory, but in AGP for display lists. Second, display lists can usually surpass VAR because the driver can do other optimizations as well. But VAR can be faster because you can store data in video memory, and it’s more flexible because you can change data during runtime, and with VAR you (almost) always know what you get. If you have a large amoung of geometry, this may stay in system memory with display lists, and could be very inefficient.
What is the expected vertex rate?
I don’t see why 4., 5. and 6. should be slower than 3. This actually shouldn’t have anything to do with the vertex cache, since even with 3., the geometry engine could handle the number of triangles transformed without a vertex cache (at least for strips). So it’s something else at work here. And I’m not sure about being setup limited - this doesn’t sound logical to me.
How did you do the vertex cache simulation?
There’s code (it’s actually very simple) for that in the NvTriStrip lib at developer.nvidia.com. You just put the indices onto a fifo and for every index you send, you only count it if it’s not already in the cache.
Do you see any disadvantage of using display list instead of VAR?
I mean, if the driver implementation is good (and I think it is), then display list allows for no data to be sent over AGP at all (except the small glCallList token).
As said above, no dynamic meshes, bad for very large meshes, no control. But if you have a medium number of very small meshes with state changes in between, you can’t beat display lists.
730Mhz P3 for the GF2 Ultra
with AGP 2x, I guess… Well that might explain the slowdown for the larger meshes with independent triangles on the GF2 - lot’s of index traffic!
1. indep tris, non-shared verts, 6x32 -> 35:12 134:44
2. VAR indep tris, non shared verts, 64x64 GF2U->32:11 GF4->47:16
I don’t understand - why is the GF4 suddenly so slow for independent tris? Are you sure you really tested non-shared vertices the first time round?
And no, on a PIV I don’t think that index traffer can be a problem…
Michael