triangle strip cache size

Hi.

What are the cache sizes about triangle strips ?

As written by Nvidia:

On cache sizes:
Note that it’s better to UNDERESTIMATE the cache size instead of OVERESTIMATING.
So, if you’re targetting GeForce1, 2, and 3, be conservative and use the GeForce1_2 cache
size, NOT the GeForce3 cache size.
This will make sure you don’t “blow” the cache of the GeForce1 and 2.
Also note that the cache size you specify is the “actual” cache size, not the “effective”
cache size you may have heard about. This is 16 for GeForce1 and 2, and 24 for GeForce3.

I’ve never encountered problems with rendering such primitive associations, but I didn’t use them so much.
Do they match the number of ‘cumulative calls to DrawArrays’ ? Or am I wrong ?

Do all graphic cards have the same thing ?

The cache size determines how far back in the stream that an index may look, and “hit” a previously transformed vertex of that same index, saving re-transforming that vertex again.

This makes it matter how you order your triangles within a single index list; grouping uses of the same index together will likely lead to higher vertex transform throughput, if you’re vertex transform limited.

The vertex cache cannot re-use data between successive calls to DrawElements() (or any other call), only data within a single indexed call.

Thank you.

If I understood well (with helps I got), cache is good for rendering same-transformation-matrix meshes. And only if we render them without breaking with a new transformation matrix. (sorry for that bad turn phrase, cannot say it better for the moment).

Does this cache fits well with how much vertices an array can contain ?

And what about other graphic cards than Nvidia ? Where to find documentation about that ?

This cache is not to allow rendering the same mesh multiple times at a higher speed. The cache is not big enough for that; a typical post-T&L vertex cache is about 16 entries (depends on the GPU).

It’s to allow improving the vertex transform throughput. For example, if you specify two triangles that are connected to each other (using triangle list), you’ll have something like: 1,2,3 and 2,3,4.

As you can see in that example, the 3D card without a vertex cache has to compute twice the vertex 2 and 3.

Ordering your vertex indices with the vertex cache in mind can really improve the speed of a 3D scene that is T&L limited.

Note that it’s better to UNDERESTIMATE the cache size instead of OVERESTIMATING.
In theory yes. In practice not really.
Vertex cache optimizers and triangle stripifiers, such as NvTriStrip and Tri Stripper, are not perfect and will rarely use the cache at 100%. So in practice, underestimating or overestimating will lead to worse performance than giving the right cache size, but the actual performance loss depends on the GPU and the optimizer being used so you cannot say which one will be worse.

Originally posted by jwatte:
The vertex cache cannot re-use data between successive calls to DrawElements() (or any other call), only data within a single indexed call.
I think it can re-use data between successive calls to DrawElements() but only if you do not change the vertex pointers.
Afterall, DrawElements() only give new indices to be drawn, but these are compatible with previous indices as long as you did not change the vertex pointers.

I remember reading a document somewhere about optimizing a mesh in such a way that theoretically it would be optimal for any cache size…
Can’t remember the link tough :-/

From what I remember, it’s not optimal for any cache size, but scales well for different cache sizes.

http://citeseer.ist.psu.edu/bogomjakov01universal.html

BTW, is anyone using this reordering scheme with success ?

Cheers.

So I were misunderstanding even with explanations… Nevertheless, I understand better the purposes of this cache now.

LogicalError, I always have seen the Nvidia program for making triangle strips from normal triangles. It’s fully free, and has no license. The Readme file of this program explains about the caches. But it’s only intended to optimize for NV cards.
Maybe the link given is what you was trying to remember.

Nicolas, I’ll see this link, and maybe (hopefully surely) post here later about it.

Thank you all.

Originally posted by Nicolas Lelong:
http://citeseer.ist.psu.edu/bogomjakov01universal.html
Ah, yes, that’s the one.
Not ‘optimal’ but ‘scales well’, my mistake.

Originally posted by jwatte:
The vertex cache cannot re-use data between successive calls to DrawElements() (or any other call), only data within a single indexed call.
Can anybody confirm this? That would be a serious limitation, and a very strong argument for linking strips with degenerate triangles.

I read somewhere that NVIDIA and ATI card use another type of cache. NVIDA can profit from T&L post-cache only if indexed primitive is used. This leads to fact, that NVIDA’s cache uses indices to lookup in cache for transformed vertices. On the other side, ATI cards use things like a pointers(maybe from vertex pointer) to lookup in cache. So ATI cards profit from non-indexed primitive too. That all mean( it’s only my opinion) that when you draw primitive more than one time(I mean indexed primitive) with the same pointers(vertex,normal,…) you can get greater performance boost on ATI cards, ofcouse only if the primitive has only a few vertices.

Originally posted by dirk:
Can anybody confirm this? That would be a serious limitation, and a very strong argument for linking strips with degenerate triangles.
Serious limitation only if you use many very small strips so that the vertices that needs to be reshaded is a significant part of the total strip size, in which case using degenerate triangles to connect strips would already be highly recommended.

Originally posted by Matt Zamborsky:
I read somewhere that NVIDIA and ATI card use another type of cache. NVIDA can profit from T&L post-cache only if indexed primitive is used. This leads to fact, that NVIDA’s cache uses indices to lookup in cache for transformed vertices. On the other side, ATI cards use things like a pointers(maybe from vertex pointer) to lookup in cache. So ATI cards profit from non-indexed primitive too. That all mean( it’s only my opinion) that when you draw primitive more than one time(I mean indexed primitive) with the same pointers(vertex,normal,…) you can get greater performance boost on ATI cards, ofcouse only if the primitive has only a few vertices.
ATI cards have both a pre-T&L and a post-T&L cache. The post-T&L works on indices, the pre-T&L like a regular cache mirroring the memory.

yes I didn’t say that NVIDIA or ATI have only pre or post cache. NV and ATI too, have the pre cache which works as you said Humus, and pre cache as you said too. But the post cache in nvidia cards works differently than the cache in ATI.

Originally posted by Humus:
[quote]Originally posted by dirk:
Can anybody confirm this? That would be a serious limitation, and a very strong argument for linking strips with degenerate triangles.
Serious limitation only if you use many very small strips so that the vertices that needs to be reshaded is a significant part of the total strip size, in which case using degenerate triangles to connect strips would already be highly recommended.
[/QUOTE]It would be an optimization if the cache was valid with consecutive draw calls.

NVIDIA can profit from T&L post-cache only if indexed primitive is used.
glDrawArrays won’t be reusing the same vertices, but a consecutive call to glDrawArrays can, so the post-cache can remain valid.
I really don’t know what they do.

If Nvidia does use indices and they do prefer ushort and have a limit on max indices and max vertices, maybe this hints at something.

Well, plenty thanks for all of this !

So, post-cache works on indicies, and pre-cache on memory ‘mirroring’ and not-depending on the graphic card.

Does this mean that if you don’t use indicies inside your VA, you’ll fall into no cache optimization ??

Does this mean that if you don’t use small triangle strips (with or without degenerate triangles ?), you’ll also fall into no cache optimization ??

So, a degenerate triangle is only a triangle that has the faces inverted regarding the mesh ? (ie the degenerate triangle has front-face-culling while the mesh doesn’t) ?

– always plenty of questions

Does this mean that if you don’t use indicies inside your VA, you’ll fall into no cache optimization ??

Yes and No :stuck_out_tongue:

If you don’t use indices on NVIDA cards you have only pre-cache, which is for hiding the latency of memories. On the other side using ATI card you have both caches.

And ofcourse if you use glBegin/glEnd you have neither pre-cache nor post cache.

yes, sorry for my bad text, it should have been:

Does this mean that if you don’t use indicies inside your VA, you’ll fall into no post- cache optimization ??

And so for the second (pre-cache).

Well, in order to avoid mis-understandings, what are the indicies ? Those we can pass to DrawArrays, or those to give to IndexPointer ?
Because, I personnally don’t like to call to IndexPointer (this makes me think faces could better be ordered instead of let in a ‘rubbish’ manner, so avoiding relocating pointers many times during one draw call).

Now that you arised that in my mind, do those caches works only for strips or for all arrays ? (I think for all arrays in fact).

the indices passed to DrawArrays. And the answer to your second qustion is YES. It can be used for any primitive.

I think it can re-use data between successive calls to DrawElements() but only if you do not change the vertex pointers.
This is not correct. Even if you don’t change any of the pointers, you could change the data that gets pointed at – the GL specification allows it (and programs do this for streaming draws). Thus, the post-T&L cache is only valid for a single invocation of DrawElements().

The LockArraysEXT() extension COULD be modified or interpreted in this context to allow for this semantic, but I don’t think any driver implements this optimization, and it probably won’t ever be done.

It’s interesting to note that VBO may allow the IHV to actually optimize and retain the post-T&L cache, because they can detect modification of the data through the VBO API. However, this would probably be complicated enough that I don’t think any of them do this, either. The gain would be pretty small, compared to the function call overhead of each DrawElements() call.