tri strips/fans hardware side: reset, degenerates, and anything else

GPSnoopy · March 4, 2005, 6:36am

The remapping of the indices is for reorganizing the vertex arrays (and not the indices themselves). The goal is to improve the memory accesses by improving the cache coherency (but do not mistake this for the post-T&L cache, which is something completely different).
For example, a vertex array that is accessed sequentially is faster than one that is indexed randomly.
1, 2, 3, 2, 3, 4, 3, 4, 5, etc. will be faster than 1, 456456, 2547, 84125, etc.

IMHO, one good way to reorganize the vertex array is to take the indices one by one and to reconstruct the vertex array in the same order (it’s bond to work fine with triangle strips generated by NvTriStrip or Tri Stripper, because of the post-T&L cache algorithms used there).

The triangle stripifier with the red lettered logo is Stripe, not Tri Stripper.

On Tri Stripper webpage you can find a comparison with NvTripStrip, although a bit outdated now 'cause Tri Stripper has improved since them (and probably NvTriStrip too).

The main advantage that NvTriStrip has is that it outputs better results, but the difference compared to Tri Stripper is minimal. Although I had people telling me Tri Stripper was giving better results for them, but I could not benchmark this myself.

Tri Strippers is close to NvTriStrip in the way that it takes a triangle list and only outputs triangle strips. It can take into consideration the presence of a T&L cache, but you can also disable that (which is an advantage over NvTriStrip).
Tri Strippers is also more robust than NvTriStrip, it can handle not-so-common geometry (e.g. a triangle that has more than 3 neighbours). But it’s also much faster (1 sec vs minutes for tens of thousands of triangles).

NvTriStrip is not slow cause of the lack of connectivity information (Tri Stripper also has to compute it), but IIRC it has some algorithms that have a complexity of O(n^2).

I suggest you look at Tri Stripper webpage for more info on the algorithms that are used; again it’s a bit outdated and my English is not so bright but it’ll give you a good overview.

michagl · March 5, 2005, 5:31am

Originally posted by GPSnoopy:
The remapping of the indices is for reorganizing the vertex arrays (and not the indices themselves). The goal is to improve the memory accesses by improving the cache coherency (but do not mistake this for the post-T&L cache, which is something completely different).
For example, a vertex array that is accessed sequentially is faster than one that is indexed randomly.
1, 2, 3, 2, 3, 4, 3, 4, 5, etc. will be faster than 1, 456456, 2547, 84125, etc.

i definately would never have guessed that. it seems odd to me that the hardware would be organized that way. can you or anyone make an architecture argument for this?

The triangle stripifier with the red lettered logo is Stripe, not Tri Stripper.

hmmm, maybe i overlooked yours then. i promise to give it a look as soon as i pass this post on. i think i looked at two, one ‘atc’ or something, and probably stripe.

The main advantage that NvTriStrip has is that it outputs better results, but the difference compared to Tri Stripper is minimal. Although I had people telling me Tri Stripper was giving better results for them, but I could not benchmark this myself.

i’m actually working on an interesting project right now that might make for a good benchmark. i’m stripping about 200,000 small very similar meshes. you can find a recent visual at:

http://arcadia.angeltowns.com//share//genesis-mosaics-lores.jpg

assuming that url is correct. to me, the resulting strips don’t look to be terribly cache friendly, especially given the simplicity of the meshes. looks like there is a lot of room for improvement to be had. those strips were done with nvidia’s utility library btw.

NvTriStrip is not slow cause of the lack of connectivity information (Tri Stripper also has to compute it), but IIRC it has some algorithms that have a complexity of O(n^2).

yeah, i imagined so much… just wishful thinking on my part that connectivity might be an issue. just out of curiosity, what percentage is connectivity versus cache constraints/strip fitting?

i regularly strip fairly large meshes for general projects, but this project in the screen above is basicly composed of a whole lot of relatively small meshes… so in that case the exponential performance is not a major concern. i have had to walk away from my machine in the past for about 10minutes while nvidia’s stripper spits out a fair sized mesh… but stripping these 200k little meshes generally runs about a day. and i suspect in the future the size of the meshes will probably go up significantly for different optional resolution databases. i’m a little bit frightened about how that might scale… 2 or 3 days, maybe more… so basicly i am looking for faster alternatives.

especially if and when i provide a conversion tool for end users to use on their own machine. for the users, there option would be either to download the database for their hardware and needs, or generate it locally. the download for a database that takes about a day to generate is around 20MB. but downloading also adds traffic to the download server. and for larger databases and users with lower end modems, it might be in thse users interest to generate the data locally. so ideally, to reach a happy medium, i would just like to be able to get the database generation tool as streamlined as possible, and just host common, smaller databases… because the range of options could be too high to satisfy everyone’s hardware and application needs.

I suggest you look at Tri Stripper webpage for more info on the algorithms that are used; again it’s a bit outdated and my English is not so bright but it’ll give you a good overview.
i will give it a look. just out of curiosity… are you fully comfortable with the nvidia algorithm? or are the poorer quality strips just a natural result of the obvious performance boost? in the end i could just compile yours and nvidias code (and maybe others) into the utility, and it would just be up to the user to choose between time and performance when choosing options for compiling their database.

if you are interested in what i’m doing with all of this stripping… there is another very imformative thread in this forum (advanced) titled something like, “unique ROAM VBO issues and a clincher”, if my memory serves me.

sincerely,

michael

PS: in defense of nvidia’s stripper. my meshes are really not very strip friendly on the face… much more fan like in organization. so that could explain to a fair degree why the strips might not look too cache friendly. what would really be nice i figure would be a stripper that can make the decision, against some user defined visual bias, to flip the inside edges of quads here and there to promote a better stripping. how much trouble do you think it would be to integrate that kind of feature into your system?

PPS: counter to nvidia’s defense, they could’ve stuck a degenerate triangle here and there to get a much better cache layout it seems. there appears to be some degenerates in the highlighted mesh in the screen referenced above, even with primitive_restrat enabled… meaning nvidia’s stripper does at least have some capacity to negotiate between the appropriateness of a degenerate versus a restart.

imported_jwatte · March 5, 2005, 7:15pm

can you or anyone make an architecture argument for this
Yes. It’s just like any other cache: the big cost is latency, so when you take the time to read your vertex (which might be 40 bytes), you read some more (say, 64 or 128 bytes) in an aligned burst, which takes very little extra time compared to reading only the 40 bytes. With luck, the next amount of data you read is close to what you read previously, so there’s no need to go to the bus at all – a net savings of your read latency (which is usually much bigger than your burst time).

Anyway, if you’re spending significant time on your tri stripper, FIRST make sure you actually need it. There ARE cases where a tristrip will draw faster than a triangle list, but in most cases, a triangle list will be just as fast as a tristrip, because of the post-TnL cache. Don’t do complicated work if you don’t have to. The difference is the amount of index data you have to push across the bus, but again, sequential bursting is very fast.

michagl · March 8, 2005, 3:53pm

thanks for the explanation… and i would’ve replied sooner, but i didn’t realize there was a new post in this thread.

for what its worth, like you said was my best guess like i said above… but its good to hear something for sure.

is that how the cache works period? because i always imagined the cache working by recognizing when an already used vertex is being used. or does it just rely on big reading chunks?

if it doesn’t do both, i get the impression that this is a bad move for hardware. because it wouldn’t make so much sense for the cache model to work best for triangle lists, while most would use strips for optimized rendering.

however for the app i’m interested in here, i’m pretty much out in the cold on the burst read, because my vertices are all over the place due to a subdivision pattern. however there is no reason why i can’t use an index for a level of indirection, other than the extra cpu side offset addressing… or technicly the vertices are hardly ever accessed directly, remapping the triangle vertex index etc would due, without the extra indirection. strip dependant vertex buffers would be out of the question for my app, but lists would just take up extra memory, even if they were comparably efficient. i figure strippers do the best they can anyhow to account for the vertex buffer order, if that isn’t the only thing the pre tnl cache cares about.

is that how it works… is it a pre-tnl cache that reads video memory in bursts, and a post-tnl which works like a traditional cache.

i really don’t understand this completely. do hardware caches hold addressable copies of memory, or just ajacent copies of memory, or one or the other?

so thats another optimization to think about.

just to reemphasize this question:

does the cache not use any model that matches already used vertices by index? and is the cache always linear? or can it be filled from vertices throughout the bound buffer?

as long as i’m here. there is definately a major bug in the nvidia triangle stripper which flips the winding of strips. it is dependant on the triangle indices fed into the generate function. which means, you can potentially avoid the bug by just rotating your triangle indices per triangle.

i just upped my triangle count 3 fold, and the bug showed up all over the place. fortunately i was able to rotate my vertices to get rid of the bug’s effect.

just to be totally clear, when i say ‘rotate’ the winding remains the same, x becomes y, y becomes z, z becomes x, etc…