# Open source Tri Stripper (vs NvTriStrip)

i’ve written a bruteforce-stripper that needs about a minute for 1400 triangles…designed for pre-calculating static objects.

how to optimize it for a specific vertex cache ?

Originally posted by HamsterofDeath:
[b]i’ve written a bruteforce-stripper that needs about a minute for 1400 triangles…designed for pre-calculating static objects.

how to optimize it for a specific vertex cache ?[/b]

Well I dunno exactly how your stripper works so I can’t say exactly what you have to do.

For the vertex cache part, Tri Stripper selects the triangles that were the neighbours to the last made strip.
Then it tries to build every possible strip from those triangles and keeps the one that has the most cache hits.

> how to optimize it for a specific vertex cache ?

By reordering triangles, so that vertices of triangle N are also vertices of triangle N-1, or N-2 etc. until N-x gets you out of the cache range.
Obviously, that won’t be possible for all triangles, but if each new triangles introduces only 1 new vertice on average, your performance will be as good as a strip or fan.
Since you can introduce less than 1 new vertice per triangle in some instances, pure vertex cache optimization can theoretically end up faster than pure stripification.

If you’re not afraid of Pascal, you can find a fast (greedy) algorithm to optimize for vertex caches in GLScene (http://glscene.org, function IncreaseCoherency of MeshUtils.pas, optimizes a 2500 tris mesh in less than 9 ms on an AXP 1800+).

i bet my algo gives a better result :-P, will take some minutes…

I am very interested in this triangle-stripper, since it uses a nicer algorithm than the NvTriStrip (the NV one searches the entire edge list each time it searches for a matching edge).

However, I think that you can achieve better cache coherency if you optimize only for lists, since you don’t have to share an edge with the current triangle when selecting the next one. Also, strip restarts might end up “flushing” the cache, whereas a list can go back and visit older vertices before they drop out of cache, where a strip would have to restart to get near them - introducing a degenerate or a restart. You’d also not have to worry about the vertex ordering when constructing the lists, since the triangles would just be rearranged as a whole instead of inserted into a strip[ which is sensitive to clockwise/counterclockwise winding. If you run the direct3d optimized mesh sample, you can see that lists give a slight advantage over strips.

Also, instead of using a sorted array, could you not use a hash table to store and lookup matching edges, bringing that operation from O(n log n) to O(n)?

I’m very interested in integrating this code into my engine as a run-time step when loading large meshes, such as terrain or animated models. I hope you can take it that extra step and optimize for lists rather than just doing stripification and combining the strips in an end-to-end fashion to construct a list.

[This message has been edited by Assassin (edited 01-19-2003).]

When optimizing the tri stripper for vertex cache I wondered if it wouldn’t be better to make an optimized triangle list instead.

But tri lists don’t really have that much advantages over triangle strips.
Actually from what I’ve heard tri strips are slightly faster in OpenGL, so… The thing is that with lists you still have to send all the indices, which isn’t the case with strips.
Also it doesn’t seem there is a vertex cache flush at a strip restart. If it was the case, NvTriStrip and Tri Stripper vertex cache optimizers wouldn’t work so well.

I might extend the tri stripper in the future so that it can optimize lists too, but it’s not a priority right now.

However I’ve been thinking how I could optimize the whole algorithm.
As you pointed out a hash table would be more suited for the first part of the algorithm.
I’m also looking how to replace the heap in the second part, to also bring that part from O(n.log(n)) to O(n).
There are a few details to be sorted out but I’m hoping to make the algorithm go from O(n(log(n) + c)) to O(n.c) in theory, and of course reflect this as a serious speed boost in reality.

Edit: There is a lot of things to do, and so little time to do it. I can’t really tell when this new version will come out, but I’m working on it.

[This message has been edited by GPSnoopy (edited 01-19-2003).]

Well, the reason I say that the strip restarts will “flush” the cache is because it seems you’re selecting a new strip start triangle by criteria unrelated to the cache entries. You seem to be selecting the “loneliest” triangle, without any checks to find a triangle that still has vertices in the cache when doing a restart/jump. I couldn’t quite decipher what’s happening in the code, but it seems that when you run out of triangles for the current strip, the selection of a new triangle to start a strip with isn’t including checks on cache coherency.

Also, if you make effective use of vertex shaders on new hardware, you only need to upload most of the data once, mitigating the benefit of the strips’ smaller index list, and making the lists’ (potentially) higher performance more relevant.

And I just thought I’d mention that the “perfect” vertex cache miss rate is 0.5 - so for every 2 triangles you introduce 1 new vertex. This isn’t possible with a finite cache, but I think it’s possible to get below 0.7 with a good algorithm.

You’re right, there is a case where I don’t check cache to select the start triangle.

Normally I use the neighbour triangles of the latest strip to find the start triangle for the next strip, this way I know I have good chances to make “cache happy” strips.
But it might happen that all the neighbours were already taken. When that happens I pick up a start triangle in the heap (aka the loneliest triangle) without checking the cache.

Hopefully this case doesn’t happen often, and the triangle on the top of the heap has chances to be cache friendly.

Thanks for pointing out that problem.
The new structure I’m working on that could replace the current triangle heap will in theory offer better performance and be a little more cache friendly.

PS: An annoying thing: std::hash_multimap is exactly what I need to make the hash table. Unfortunatly it won’t be standardized before the next C++ standard in 2004 or 2005, which means that currently each compiler has its own (unportable) version. Damn…

[This message has been edited by GPSnoopy (edited 01-20-2003).]