Post-TnL Vertex Cache use on nVidia?

Can anybody give me a hint as to when the vertex cache is actually used? When running my tests with an OpenGL library wrapper I can see that glDrawRangeElements is just using the immediate mode API. I’m using CVA and DrawRangeElements with 32 bit indices on an FX 5800, drivers 43.63 on Linux.

Do I need to go VAR to get the cache? That would be a lot of work, which I would rather like to avoid if it doesn’t help. VBO looks like a more interesting alternative right now, but I’d like some confirmation before going in that direction. Or is it enough to just use 16 bit indices (when possible)? Any experience?

Thanks

Dirk

CVA isn’t going to do it for you.

You will need to use VAR or VBO. It may be that nVidia’s current VBO extension isn’t sufficiently optimized to use the post-T&L cache, but even if it isn’t yet, it will be soon enough.

IIRC you dont need to use anything but indexed vertex arrays to get use of the post T&L cache.

Are you actually referencing identical vertices at all? It only works if the indicies of the vertices are the same.

Any indexed draw call (DrawElements, DrawRangeElements) should take advantage of the post-T&L cache. Cache usage is based on the index.

If you’re using a wrapper library, it may be expanding your Draw calls, but the driver should not be.

Thanks -
Cass

Cass,

Matt has previously posted here saying that you won’t get T&L cache utilization for plain DrawElements calls without VAR or similar extensions. My take on that is that the driver doesn’t scan the entire index array to calculate min/max index value, but instead just expands the array as it finds it, and does a DrawArrays() equivalent.

It may be that this has changed, which I believe would be good news for many possibly common usage patterns?

Originally posted by jwatte:
[b]Cass,

Matt has previously posted here saying that you won’t get T&L cache utilization for plain DrawElements calls without VAR or similar extensions. My take on that is that the driver doesn’t scan the entire index array to calculate min/max index value, but instead just expands the array as it finds it, and does a DrawArrays() equivalent.

It may be that this has changed, which I believe would be good news for many possibly common usage patterns?[/b]

Jon,

I could be wrong here, but I don’t think so. I’m 99% positive that DrawElements works fine with CVA, VAR, and VBO. I’m not so sure about plain vertex arrays though.
I’ll check.

Thanks -
Cass

[This message has been edited by cass (edited 05-21-2003).]

JWatte, Are we talking about the same thing? I was under the impression the post T&L cache was fixed in size, at around 20 vertices on GF4. When the next index is fetched it compares it to the indicies in the post cache, and if it exists uses that already transformed one instead.

I’m pretty sure this is how it worked on gf3/4 etc… It shouldn’t make any difference where the vertex/index data is being held.

Or have I lost the plot?

I have never been able to get any performance boost from trying to use the vertex cache. I’ve never been able to get implicit strips to work either. My explicit strips give a tremendous boost to performance but when I use GL_TRIANGLES it simply just doesn’t matter what order I give the triangles/vertices in.
I always used standard vertex arrays until VBO (I avoid vendor specific extensions like the plague), this post made me think that may have been the cause but still no cigar. I’ve never heard of such a requirement before.
I must be missing something, I’ve heard too often of implicit stripping to think it just doesn’t work. Are there other strict requirements?

You got me worried for a sec, so I pulled some old bench and just checked: at least on a GF3 with 44.03 drivers, the cache is still active, via a simple glDrawElements, both in immediate mode CVAs and through display lists (did not check VBO or VAR, though I guess it’s active for these too).

By simpling reordering triangles, I’m getting roughly a 50% framerate increase on my small test model (5k tri).

At the time, it decided me not to bother with full-blown stripification and just go for increased vertex indices coherency: the algorithm is simple and fast, and on large models you can get quite close to what NVTriStrip gives, in just a few seconds (vs. hours), post-TnL cache is great

Ok, I just did some further tests and saw something like a 500% perfomance increase with VBO and 100% with standard vertex arrays. The test data I was using happened to already be in pretty optimal strip order so I wasn’t seeing an improvement from my rearrangement of it.

Still, strips seem to work well but some approximate further vertex cache awareness doesn’t seem to do much at all. I guess you have to do it proper like NVTriStrip to get some results there.

And still explicit joined primitives give me far better performance, I can only guess that this is due to the bandwidth savings for indices? VBO indices don’t buy me much, but 16 bit over 32 does.
Possibly it’s only because I have very high polygon counts.

> JWatte, Are we talking about the same thing?

Nutty: yes, we are. Basically, what I remember hearing, and what would make sense given the other reports below your posts in this thread, is that plain vertex array DrawElements() calls will be turned into DrawArrays() inside the driver, and thus there will be no index sent to the card, and thus no cache utilization.

It seems from reading the other posts that display lists, CVA and VAR all allow the driver to send indices and thus take advantage of the cache, so it would seem that only plain, unextended vertex arrays have this problem.

Are you saying that 2 or 3 vertices will hang around (enough for strips) but not more? At least 2 must be exploiting reindexing or I wouldn’t see a 100% increase from triangle arrangement on standard vertex arrays.
I seem to rememeber that the last two vertices are kept by triangle setup or something like that, regardless of the vertex cache. But the indexes would still need to be considered.

Maybe I’ll give CVA another shot too, I never got anything out of it on NVIDIA boards but I did on some other implementations.

Originally posted by cass:
[b]
Any indexed draw call (DrawElements, DrawRangeElements) should take advantage of the post-T&L cache. Cache usage is based on the index.

If you’re using a wrapper library, it may be expanding your Draw calls, but the driver should not be.
[/b]

Thanks Cass, a little more testing narrowed it down: I had the display list cacher active, and everything was put into a dlist. In that case the driver unrolls it (the wrapper lib is really just a 1:1 wrapper, it doesn’t do anything but logging). When not using the dlist, no immediate mode commands show up, so I guess the cache is used.

That would also nicely explain why dlist mode is slower… :-/

Thanks and sorry for the confusion. But judging from the other posts I was not the only confused one.

Dirk