136 M verts/sec on GeForce4 Ti ?

I acheive this figure (on GF2 GTS) also, with simpler very-long triangle strips, compiled into a display list (instead of interleaved VAR). The numbers agree with the 250/200 core clock difference between GTS and ultra

But my GF2 GTS can do it with VAR. It’s still strange that using the same rendering mode (VAR), my GF2 (200 core) renders at >24MT/s, and your GF2U (250 core) only at 22MT/s…

What does CPU has to do with it? With VAR or display list, its being ‘read’ from video memory, not over AGP, when its drawn.

Well, the CPU still spends a lot of time in the driver with indexed primitives. I don’t know why, but it does. If you use glDrawArrays, this gets much, much less, but then can’t use the vertex cache. But you could try using glDrawArrays for the independent triangles with non-shared verts…

But my original benchmarks were using display lists, for exactly that reason. Matt’s dictum “use VAR” moved me off.

There are two sides to this. First, Matt or Cass once said that they are not storing geometry in video memory, but in AGP for display lists. Second, display lists can usually surpass VAR because the driver can do other optimizations as well. But VAR can be faster because you can store data in video memory, and it’s more flexible because you can change data during runtime, and with VAR you (almost) always know what you get. If you have a large amoung of geometry, this may stay in system memory with display lists, and could be very inefficient.

What is the expected vertex rate?

I don’t see why 4., 5. and 6. should be slower than 3. This actually shouldn’t have anything to do with the vertex cache, since even with 3., the geometry engine could handle the number of triangles transformed without a vertex cache (at least for strips). So it’s something else at work here. And I’m not sure about being setup limited - this doesn’t sound logical to me.

How did you do the vertex cache simulation?

There’s code (it’s actually very simple) for that in the NvTriStrip lib at developer.nvidia.com. You just put the indices onto a fifo and for every index you send, you only count it if it’s not already in the cache.

Do you see any disadvantage of using display list instead of VAR?
I mean, if the driver implementation is good (and I think it is), then display list allows for no data to be sent over AGP at all (except the small glCallList token).

As said above, no dynamic meshes, bad for very large meshes, no control. But if you have a medium number of very small meshes with state changes in between, you can’t beat display lists.

730Mhz P3 for the GF2 Ultra

with AGP 2x, I guess… Well that might explain the slowdown for the larger meshes with independent triangles on the GF2 - lot’s of index traffic!

1. indep tris, non-shared verts, 6x32 -> 35:12 134:44
2. VAR indep tris, non shared verts, 64x64 GF2U->32:11 GF4->47:16

I don’t understand - why is the GF4 suddenly so slow for independent tris? Are you sure you really tested non-shared vertices the first time round?

And no, on a PIV I don’t think that index traffer can be a problem…

Michael


But my GF2 GTS can do it with VAR. It’s still strange that using the same rendering mode (VAR), my GF2 (200 core) renders at >24MT/s, and your GF2U (250 core) only at 22MT/s…

At least it renderes at 31 MT/s with display list…

I wonder what happens if I plug the DrawElements in a display list. Will it ‘remember’ the “vertex repeats” and utilize the vertex cache when I call the list?


There are two sides to this. First, Matt or Cass once said that they are not storing geometry in video memory, but in AGP for display lists

I don’t think this can be true anymore.
I don’t think you can push 47M verts/sec over AGP, even if the verts are only 2 floats, and the AGP is x4

But VAR can be faster because you can store data in video memory,
Again, I think display lists are also stored in video memory

and it’s more flexible because you can change data during runtime,
This is true

If you have a large amoung of geometry, this may stay in system memory with display lists, and could be very inefficient.
That’s one reason the GF4 has 128MB :wink: Its not only for textures…

And I’m not sure about being setup limited - this doesn’t sound logical to me.

Why not?

with AGP 2x, I guess…
No, sorry … :wink:
It reports succesfully setting AGPx4.
Its a VIA Apollo chipset

1. indep tris, non-shared verts, 6x32 -> 35:12 134:44
[b]2. VAR indep tris, non shared verts, 64x64 GF2U->32:11 GF4->47:16

I don’t understand - why is the GF4 suddenly so slow for independent tris? Are you sure you really tested non-shared vertices the first time round?
[/b]
I don’t understand either. The difference is the mesh size.
Yes, I’m sure it is independent tris in both cases. I even repeated the tests.
To add to the mystery, with display list, it is 73 regardless of grid size, while with VAR it jumps to 133 with a 6x32 mesh and decreases as N increases (Nx32 mesh)

I wonder what happens if I plug the DrawElements in a display list. Will it ‘remember’ the “vertex repeats” and utilize the vertex cache when I call the list?

I see no reason the vertex cache shouldn’t be active in a display list. The only requirement is that the number of vertices is known and they reside in some memory where the GPU can pull the vertices itself - and that should be the case for a DL…

I don’t think you can push 47M verts/sec over AGP, even if the verts are only 2 floats, and the AGP is x4

Why not? 4 bytes per float * 2 floats * 47 million = 376MB/s, easy for AGP 4x (1024MB/s). Even 77MVert/s is only 616MB/s, something the GPU can certainly achieve (I can push about 920MB/s into AGP if the GPU doesn’t use it at the same time).

Why not?

If you are setup limited, why should performance decrease with a larger mesh? Setup overhead shouldn’t change with mesh size. And the vertex cache shouldn’t have anything to do with it, as the GF2U can transform 31 million vertices/s for triangle strips even without the vertex cache… See what I mean?

I don’t understand either. The difference is the mesh size. To add to the mystery, with display list, it is 73 regardless of grid size, while with VAR it jumps to 133 with a 6x32 mesh and decreases as N increases (Nx32 mesh)

Two things come to mind here:

  • a jump from 47 MVert/s for the smaller mesh to 133 MVert/s with the larger mesh sounds like the effect of the vertex cache kicking in (although this is not possible, since you are not sharing any vertices)
  • on the other hand, going down from 133 MVert/s to 73 when going from DL to VAR could well be explained by the DL being in AGP and VAR being in video memory…

Michael

Ok, I give up for the moment. Didn’t see anything wrong with the code at first sight.

Another data point: On a GF3 Ti500, I can do 37 million triangles/s on a 3440 triangle regular square grid, after applying nvtristrip, using strips. 2691 vertices actually get transformed, so that’s a consistent 29 million vertices/s. That’s no textures, no materials.

Michael