136 M verts/sec on GeForce4 Ti ?

I suppose independent triangles w/ no vertex reuse must be the way to go, then – I was a bit confused. This might easily be a setup bottleneck you’re measuring, not a transform one.

There is more in the chip that’s changed than just the number of pipelines and the clock speed.

Also, even 100 Mvertices/s with 3F vertices is more than AGP 4x allows. Use 2F or 2S vertices. Video memory is also good.

  • Matt

Originally posted by mcraighead:
I suppose independent triangles w/ no vertex reuse must be the way to go, then

I also tried independent triangles


This might easily be a setup bottleneck you’re measuring, not a transform one.

This is exactly my suspicion. I don’t think there is a way to separate the two. This is also why I think you cannot spec the first if the second is lower.


Also, even 100 Mvertices/s with 3F vertices is more than AGP 4x allows. Use 2F or 2S vertices.

But I thought VAR took care of that. The only thing that is transferred is the index list, which is shorts. Pitty I can’t pack it into a display list (as spec’ed, a display list pulls the vertex data at compile time).

Anyway, I did use 2F vertices.

Matt, do you have a benchmark program that demonstrates this performance?

[b]
Video memory is also good.

[/b]
This is why I started out with display list (of 2F vertex data). Actually, this is what gave me peak performance in previous hardware, and also in GF3 it gets the fastest time (although only by a small lead over VAR)

Can you explain how you test this?

Well, I have two different memcpy routines: one from the AMD processor optimization guide, and one very simple SIMD one from the net. The AMD one gives me about 930MB/s for AGP, and about 560MB/s for vidmem. The simple SIMD one gives me about 730MB/s for AGP, and 750MB/s for vidmem. This is all tested on an idle GPU.

It gets interesting when the GPU is busy rendering (i.e., pulling vertex data out of the same memory). Then, the AMD-AGP memcpy drops from 930MB/s to 510MB/s, which is reasonable, since now the GPU and CPU have to share the 1024MB/s bandwidth you get with AGP 4x. However, if I use vidmem under load, the SIMD-vidmem memcpy only drops from 730MB/s to about 700MB/s, which is reasonable as well, since video memory is >2GB/s, and the simultaneous GPU/CPU access is barely noticed.

This carries through to all other tests where there is a difference between AGP and vidmem, as well. So, when I test the memory I get from wglAllocateMemoryNV, I compare the speeds with the speeds above (which were obtained using small requested memory size and no textures loaded, so I am sure I got the correct memory), and sometimes the characteristics match exactly the AGP case, sometimes the vidmem case.

Whether the data still ends up in vidmem if I request vidmem but the memory has AGP speed characteristics, I can’t say, but I doubt it… (btw, you need fastwrites enabled for the vidmem-speeds above)

About the GF 4: I haven’t got one yet, unfortunately

Michael

[This message has been edited by wimmer (edited 03-21-2002).]

I suppose independent triangles w/ no vertex reuse must be the way to go, then – I was a bit confused.

Yes, I think testing indices/s (by using the vertex cache) is quite silly if you want to test transform speed.

Testing independent triangles as Matt suggests will give you the “real” speed of the transform engine, independent of your geometry.

Then the number of indices reflects the actual number of vertices transformed. What speed did you get with independent triangles (i.e., triangles which DON’T share vertices)?

This is exactly my suspicion. I don’t think there is a way to separate the two. This is also why I think you cannot spec the first if the second is lower.

If you use independent triangles, you need to transform 3 vertices per triangle, not 1, so the bottleneck should go back to transformation and not setup. Still assuming you want to test vertices/s, not indices/s…

BTW, you can get the 30+ number on GeForce2 Ultra too…

Obviously, since the Ultra is clocked higher than the GF3…

One more thing about vertices vs. indices: If I use a regular grid and render it as individual triangles in strip order on the GF3 Ti 500, I achieve 28.5MVertices/s (that means actually transformed vertices, counted by simulating the vertex cache in software as in the NvTriStrip example), but 86MIndices/s (i.e., Million Indices sent over the bus!). So there doesn’t seem to be a setup bottleneck in this case, and you should achieve at least as much on the GF4…

This is why I started out with display list (of 2F vertex data).

I think Cass or Matt once stated that vertices are kept in AGP memory for display lists… But they may include other optimizations, e.g. like also keeping indices in fast memory…

Michael

[This message has been edited by wimmer (edited 03-21-2002).]

Originally posted by wimmer:
[b]
Testing independent triangles as Matt suggests will give you the “real” speed of the transform engine, independent of your geometry.

Then the number of indices reflects the actual number of vertices transformed. What speed did you get with independent triangles (i.e., triangles which DON’T share vertices)?
[/b]

Indeed with independent tris I get 167 Mverts/sec. It looks like the setup was the bottleneck.
But then with this method of counting, the GeForce3 (or 2) acheives 73… Higher then the stated 32.

It looks like not only the performance changed from GF2/3 to GF4 but also the measured ‘entity’. (comparing apples to oranges…)
The ratio when comparing the same things is the expected 2.3 (dual vs. single geometry pipeline, and increased core clock)

Correction
The 178 Mverts/sec was acheived with independent triangles, with shared vertices, so I guess I am seeing the effect of the “post-T&L vertex cache” (plus the fact that it is not tri-strip, so the triangle setup happens only one every 3 vertices).

With indepdendent triangles with no shared vertices, I get 134 Mvert/sec
At last, close to the nVidia stated number…

so I guess I am seeing the effect of the “post-T&L vertex cache”

Yes, exactly, that’s what I am trying to say all the time! You need to distinguish between vertices/s and indices/s! If vertices are shared, you are counting indices/s, and, most likely, some kind of setup overhead. With independent triangles, as in “not-sharing-any-vertex-triangles” you get the real speed of the transform engine.

But wow, if you really achieve 134Mvert/s, then they really improved the vertex engine a lot!

Michael

Originally posted by wimmer:

But wow, if you really achieve 134Mvert/s, then they really improved the vertex engine a lot!
Michael

Its the dual pipeline vs. single in GF2/GF3, plus the core clock increase.
Not more, not less.

If you followed the previous claimed numbers (in GF2, I think with GF3 they didn’t say anything), then you would have been led to believe the increase is more dramatical - from 32 to 136. But the point is that those numbers measure different things.
Back then they measured transform + setup.
Now they measure transform without setup.
If you use 1-vertex triangles, you’re fine :wink:

so what’t the real speedup?

what do you get on a GF2/3 with independent tris with no shared vertices?

Michael

Originally posted by wimmer:
[b]so what’t the real speedup?

what do you get on a GF2/3 with independent tris with no shared vertices?

Michael[/b]

  1. tri strip, 6xM mesh (good caching) - 47
  2. indep tris, shared verts, 6xM mesh - 93
  3. indep tris, shared verts, 5xM mesh - 114
  4. indep tris, non-shared verts - 28

All numbers are million vertices per second.

Wigh GF4, I think I got 178 with (3), and 134 with (4).
But I will have access to the machine only on Sunday to repeat the test exactly.

I know, the this gives very different factors.
I think what accounts for this is greater imporvement in GF4 with transformation, then with setup. (and both do “post-T&L caching”)

I still don’t understand how setup is influencing when there is no rasterization (glcullFace…). Is the facingess computation so intensive? The GL spec talks about computing the projected triangle’s area and looking at its sign for determining facingness, but I thought it can be done without the full exact area computation.

How comes Triangle Strips are so slow???

I mean it’s the best case to draw triangles and yet you manage to get better results with independent triangles?! Ok they are shared vertices, but still shouldn’t tri strips lead to better performances?

What do you mean by “6xM” and “5xM” mesh?

Originally posted by GPSnoopy:
[b]How comes Triangle Strips are so slow???

I mean it’s the best case to draw triangles and yet you manage to get better results with independent triangles?! Ok they are shared vertices, but still shouldn’t tri strips lead to better performances?
[/b]

With triangle mesh, you activate the triangle setup engine for every vertex you send.
With independent triangles, you activate it only for every three vertices.

If the triangle setup is the bottlneck, indep tris will be faster.
Of course this is for indep tris sharing vertices, and hardware capable of post-transform result caching.


What do you mean by “6xM” and “5xM” mesh?

6xM means a grid of 6 columns and M rows
so 6M squares and 6M*2 triangles.
Triangles are traversed along rows, back and forth (not ‘raster order’) to utilize better the vertex caching

Wait, wait! That’s very misleading!

Triangle strips are by no means slower than independent triangles in his case! Also, I don’t really see a setup bottleneck in this data.

Consider this:
In 1. (strips), Moshe is pushing 47 Million indices/s. Since it is one large strip, this equals to about the same number of triangles, i.e., 47 Million triangles/s.

In 2. (tris, 6xM), Moshe is pushing 93 Million indices, but each triangles need 3 indices instead of just 1, so the actual triangle rate is 31 Million triangles/s.

In 3. (tris, 5xM), it is 114 Million indices with 3 indices/tri, so we get 38 Million triangles/s.

So you see that triangle strips actually give you the very best performance for this mesh, and independent triangles are way slower. Obviously, this cannot be explained by a setup bottleneck, because setup is per triangle, and so for strips he is doing 47 Million setups/s, and for independent triangles only 31 or 38…

I wonder why the 6xM mesh is slower than 5xM. If you send the triangles indices in the correct order on a Geforce 3 (which has 18 effective cached vertices), you get maximum reuse, i.e., you transform each vertex in the whole mesh exactly once, even in the 6xM mesh (well, actually, there is exactly one vertex you need to transform twice). This should go for both strips and independent triangles.

Moshe: the dual pipeline + clock speed increase let’s me go from 30 Million vertices/s to 75 Million vertices/s, but not to 134! There’s something wrong here.

Ok, I think this topic is getting very confusing, especially for other readers.

What everybody has to keep in mind is that we are measuring three different entities here:

  1. actual transformed vertices/s (i.e., a vertex taken from the vertex cache does NOT count here)

  2. triangles/s (this could show up setup bottlenecks), to keep in mind how much geometry you are actually creating

  3. sent indices/s (this basically measures how effective your geometry is organized and how well you exploit the vertex cache). This can from a maximum go from 3 times the triangle rate (if you use independent triangles) to exactly the triangle rate (if you use strips).

Now for the meshes discussed here, we have about twice as many triangles as vertices, but counting it exactly, you find that you send 5 indices per vertex if you use independent triangles, and 2 indices per vertex if you use strips. Actually, in the example above, 47 * 5/2 = 118 is not so far from the 114 so you can see where this comes from…

I hope this clarifies a bit, and please, let’s talk about indices/s from now, not vertices/s, if we talk about the second parameter to glDrawElements…

Michael

I was referring to ‘speed’ when counting vertices (or indices), not triangles. This is what nVidia seems to be measuring in their spec. So in this ‘definition’ of speed,independent triangles are faster.
Of course, ultimately triangle strips are more efficient because they really produce a visible (=usefull) triangle for every vertex.

Maybe I can calrify a bit by summarizing the measurements, GF2, against GF3. I will give numbers of both triangles/second (in millions), and vertices/second. I always refer to vertices, but I say when vertices are shared and when they are not. Of course, in a strip vertices are shared by definition.

The GF2 is an Ultra with 250Mhz core clock
The GF3 is with 300Mhz core clock

X:Y means X millon vertices/second and Y million triangles/second
First is the result for GF2, second is for GF3

MxN means a mesh of M squares by N squares, which has (M+1)(N+1) points and 2M*N triangles.

If you like, you can replace “vertices” by “indices”.

VAR is always used.

  1. indep tris, non-shared verts, 6x32 -> 35:12 134:44
  2. indep tris, shared verts, 6x32 -> 90:30 180:60
  3. tri strip, 6x32 -> 31:31 60:60
  4. indep tris, shared verts, 8x32 -> 70:23 180:60
  5. indep tris, shared verts, 9x32 -> 67:22 177:59
  6. indep tris, shared verts, 12x32 -> 67:22 150:50

My conclusions:

a. GF3 transform vs. GF2 increased more than the setup. This is why with indep. tris (1) it gets 134 over the GF2’s 35 (3.8x), while with tri strip (2) it ‘only’ gets double, from GF2’s 31 to 60.

b. GF4 has bigger post-T&L vertex cache. This is why when going from ‘tight’ 6x32 mesh (4) to 8x32 (5), the GF3 maintains performance, while the GF2 goes down from 90 to 70

c. nVidia’s stated “136 vertices per second” measures performance unhindered by setup, and with post-T&L vertex cache (with some typical statistics)

BTW, I am in no way trying to diminish the GF4, just to understand what is going on…

Could you send me your test application (tfautre@pandora.be), or can we download it somewhere?

I wanna test it 'cause I can’t seem to hit 31MTris/s with my GF2U, no matter what I do with my programs. So I wanted to see if it’s because of my config or because of my programs.

I suspect that I’m never able to use video memory and that I’m at best using AGP mem,probably because I use large objects (~60K tri). I get about 10MTri/s at best using triangle strips. I conclude that my AGP bus is the limiting factor 'cause the amount of data transfered is about the speed of the AGP2X.

Originally posted by GPSnoopy:
Could you send me your test application (tfautre@pandora.be), or can we download it somewhere?

I am using Linux.
If that is usefull to anybody, please let me know and I will post it.

I was referring to ‘speed’ when counting vertices (or indices), not triangles. This is what nVidia seems to be measuring in their spec.

No, Nvidia doesn’t state the number of “indices” transformed, but the raw number of vertices their transformation engine can actually compute. I.e., this doesn’t take into account the vertex cache at all! You are not actually measuring the spec Nvidia gives. For this, you have to do a Vertex Cache simulation and find out how many vertices actually have to be transformed (vs. being taken from the cache).

c. nVidia’s stated “136 vertices per second” measures performance unhindered by setup, and with post-T&L vertex cache

As above, no, it measure performance without the vertex cache (obviously, because you achieve the 134MVert/s on a mesh where the vertex cache is never active!).

BTW, whenever you say GF3, you actually mean GF4, right?

The Geforce 2 has a vertex cache of 16 entries, with 10 entries actually useable due to pipelining. The GF3 and GF4 has 24 entries, with 18 being useable.

Another thing which hits me is that your mesh is way too small. You are not even pushing 1000 triangles here, so you might be way from peak performance.

Just to give you an indication:

I can achieve >23 million triangles on a Geforce 2 GTS (slower than your Ultra), with one texture applied, and standard 3 float vertices and non-short (i.e., integer) indices, on a small window. The mesh used for that is a simple heightfield of 86400 triangles in 600 strips and is totally vertex-cache unfriendly (almost no cache reuse). I use indexed triangle strips with interleaved arrays and VAR in video memory for that, and I can achieve this figure on a Celeron 433!

If I leave away the texture, I get >24.5 million triangles, but at this stage I start to get CPU limited so the GF2 might be able to do more. Likewise, I can’t test independent triangles because then the index traffic kill my CPU.

So try a larger mesh and see whether you don’t get higher performance - otherwise your results look quite strange to me, because you are never achieving anything like the expected vertex rate with your mesh! Please also try tri strips for the other mesh sizes.

If you have a look at the learning_VAR demo from Nvidia, you will also see that the performance drops a lot if you reduce the number of triangles. In your test, in 3. the geometry engine is actually slower (about 18.5 million transformations/s if you do the vertex cache simulation) than in 6. (22 million transformations/s), so what you might be seeing here is the effect of using a larger mesh in 6. than in 3.

What CPU do you have for those tests?

Michael

GSnoopy:

There is no way to keep a GF2 Ultra busy with AGP2x if you need to transfer geometry every frame.

The only chance you have is

  • making really sure you have video memory with VAR (I explained above how you do that)
  • put as much of your geometry into video memory and DONT TOUCH it afterwards.

If you have more geometry in a frame than fits into VAR memory, then you are out of luck. The best you can try is probalby an intelligent caching scheme.

If your geometry fits, then you should be able to achieve about 19 million vertices with one infinite light, one texture applied and a very small viewport and a large mesh, provided you use interleaved arrays. If you have long strips, this equates to 19 million triangles as well.

Michael

As above, no, it measure performance without the vertex cache (obviously, because you achieve the 134MVert/s on a mesh where the vertex cache is never active!).

Yes, you are right, sorry. The 180 (like in (2)) is where the vertex cache is evident.


BTW, whenever you say GF3, you actually mean GF4, right?

Wow! What a typo! At least I was consistent …
You are right of course. Please read GF4 where I wrote GF2 throughout.


Another thing which hits me is that your mesh is way too small. You are not even pushing 1000 triangles here, so you might be way from peak performance.

Of course I am repeating the drawing between time measurements. But why should it matter?


I can achieve >23 million triangles on a Geforce 2 GTS (slower than your Ultra), with one texture applied, and standard 3 float vertices and non-short (i.e., integer) indices, on a small window.

I acheive this figure (on GF2 GTS) also, with simpler very-long triangle strips, compiled into a display list (instead of interleaved VAR). The numbers agree with the 250/200 core clock difference between GTS and ultra


… and I can achieve this figure on a Celeron 433!

What does CPU has to do with it? With VAR or display list, its being ‘read’ from video memory, not over AGP, when its drawn.


Likewise, I can’t test independent triangles because then the index traffic kill my CPU.

This is a limitation of VAR. If I read the spec correctly, then you can’t use the index list in a display list (it ‘pulls’ the vertex data at display-list compile time), so indices are always sent over AGP. With this indeed the CPU matters. But my original benchmarks were using display lists, for exactly that reason. Matt’s dictum “use VAR” moved me off.


So try a larger mesh and see whether you don’t get higher performance

I did, but I will try again with good old display lists.

[b]

  • otherwise your results look quite strange to me, because you are never achieving anything like the expected vertex rate with your mesh!
    [/b]
    What is the expected vertex rate?
    134 without tri setup, and 60 with, seems right to me. Do you expect something different?


In your test, in 3. the geometry engine is actually slower (about 18.5 million transformations/s if you do the vertex cache simulation) than in 6. (22 million transformations/s), so what you might be seeing here is the effect of using a larger mesh in 6. than in 3.

How did you do the vertex cache simulation?

Do you see any disadvantage of using display list instead of VAR?
I mean, if the driver implementation is good (and I think it is), then display list allows for no data to be sent over AGP at all (except the small glCallList token). I think in this case mesh size shouldn’t matter either, unless I get to really small number of tris per glCallList call


What CPU do you have for those tests?

730Mhz P3 for the GF2 Ultra
1700Mhz P4 for the GF4 (four :wink: )

some more results:
(DL = display list)

  1. DL indep tris, non shared verts, 64x64 GF2U->35:12 GF4->77:26
  2. VAR indep tris, non shared verts, 64x64 GF2U->32:11 GF4->47:16
  3. DL tri strip, 64x64 GF2U->31:31 GF4->59:59
  4. VAR tri strip, 64x64 GF2U->31:31 GF4->59:59

I guess 1 vs. 2 shows that AGP can be a limiting factor in VAR due to index transfer (BTW, I am using shorts)

The triangles have area of 1 pixel