VBO layout of attributes

CatDog · January 27, 2008, 1:59pm

You mean “swap”-operations in triangle strips? Shure, the number of swaps should be as low as possible… but, uhm, it’s hard to believe, that you got a better result with simple triangle lists. (Although I don’t want to argue about that, since you measured it.)

Maybe I should do some tests on current hardware.

CatDog

Dark_Photon · January 28, 2008, 5:43am

You got a benefit (increase of performance) by dropping triangle strips? I don’t understand how this is possible, could you explain it a little bit further? What degenerates?[/QUOTE]
knackered would have to say for sure, but his tristrips-to-trilists transition may also have been an explicit vertex -to- indexed vertex transition (e.g. gl{Multi}DrawArrays -> glDraw{Range}Elements. With the former, you have no potential to get < 1 vertex shader run per triangle (aka average cache miss ratio [ACMR]). With indexed verticies, you do.

So in addition to eliminating degenerates, perhaps there was enough locality in his data that he got some vertex cache-induced speed-up even before sorting triangles to get closer to 0.5 ACMR. Also, the driver/hardware is probably more efficient in handling indexed primitives (beyond vert cache benefit), which would make perfect sense as that’s where the real performance is.

So yeah, strips are dead. The only thing they may get you now is eliminating a few bytes sending duplicate indices to the GPU, and I’ve never seen that to be a bottleneck. Besides, there’s no cross-vendor NV_primitive_restart, the conditions under which you can use even that are extremely limited, and you sure don’t want to break a batch just to start a new strip. And degenerate triangles are out as options because they’re not recognized until after the vertex shader, and thus eat triangle setup time and vertex index bandwidth.

CatDog · January 28, 2008, 6:15am

Of course, indexed primitives are for shure better than explicit vertex arrays. If knackered moved from vertex arrays to indexed vertices, he would have said so! (Err, knackered?)

I’m currently using glMultiDrawElements(GL_TRIANGLE_STRIP). The strips are somewhat optimized for cache performance, especially for good (as V-man called it) postcache usage.

I was very happy with the results, until now… would you say, it’s worth a try to change to glMultiDrawElements(GL_TRIANGLES)? (I’m really curious if it is worth it.)

<s>Ah, and what about vertex attributes like normals or shader attributes? Using triangle lists would mean to duplicate these things much more often than with strips. Don’t you see this as a drawback?</s> edit Ouch, forget this last one, please.

CatDog

Dark_Photon · January 28, 2008, 6:26am

If you’re curious I’d say yes, but switch to glDrawElements instead. If you need the multi to join strips, you no longer need that when you switch to triangles.

Past posts have stated that glMulti*Draw APIs don’t really help you at all unless you’re CPU limited, suggesting it may just be a for loop down in the GL driver, maybe saving a stack frame or two of API calls and a little setup/validation. That may have changed. Try it and see.

AlexN · January 28, 2008, 12:34pm

I can second knackered’s results, I’ve noticed a small benefit from using indexed triangle lists as opposed to indexed strips. This is with simply using the same indices as an optimized triangle strip, minus the degenerate stitches.

knackered · January 28, 2008, 12:35pm

Well of course I’m using indexed primitives - I’m not an idiot.
Catdog, you’re using glMultiDrawElements(GL_TRIANGLE_STRIP)???
And you’re getting spec performance? I always read that the multi functions just call the single functions in a loop in the driver, which would kill performance. Unless you’re compiling it all into a display list? In which case take my word for it, you’ll get much better performance using VBO with some decent manager code to reduce buffer binds (on nvidia anyway).
Oh, and always use glDrawRangeElements - otherwise something on the vendor side (driver or gpu) will have to find out this information every time you draw a batch (makes about .01 mtps difference on my setup, but every little helps).
Thanks, I forgot about the CMR, darkphoton. There’s some articles scattered around t’internet that describe the improvements you get in CMR when you use indexed tri lists as opposed to indexed strips.

CatDog · January 28, 2008, 1:04pm

Well, now I am really very curious.

knackered, what is so bad about letting the driver loop within glMultiDrawElements(GL_TRIANGLE_STRIP)? I just thought, there had to be a good reason for introducing this routine in GL1.4 - so I used it. Maybe it is giving the driver the opportunity to build something like a display list internally by itself!? Whatever. Oh my… do I always have to know driver internals to get the best from OpenGL? Obviously, the answer is: yes.

So, that’s what I’m taking home today:

Strips are dead. (Still, I have to see this with my own eyes.)
Use old glDrawRangeElements()

CatDog

knackered · January 28, 2008, 1:20pm

If the driver’s calling drawelements under the hood then the gpu is going to be starved of things to do while it waits for the next batch. This can’t be happening, otherwise I’m sure you’d have noticed very bad performance.
Just do some benchmarking - look on the box your graphics card came in and measure what million-tris-per-second you’re getting and compare with the figure on the box.
I agree that all this stuff should be hardware abstracted, but the consensus of developers seems to be against it for reasons best left to the imagination…I supposed some people get a kick out of tweaking mesh data based on a vendor string. The fact that there’s been at least 4 papers published about how best to format your mesh data based on an unknown cache size speaks volumes to me. The objections can’t be about speed of loading, because the d3dx OptimizeFaces makes a negligible difference to load times in my system.

CatDog · January 28, 2008, 2:11pm

I just tried my favorite STL file. It’s a very good 3d scan of a statue, that my stripifier can turn into a single mesh of tristrips. So this mesh is rendered by a single glMultiDrawElements-Call.

In total, it contains 1.8M triangles. On my Asus 7950GX2 it renders at around 80FPS (fixed lighting with one light source). So it is around 144M tris/second. Now anyone knows if this is a good value? I can’t find any spec to compare, only marketing buzz everywhere, damn.

CatDog

CatDog · January 29, 2008, 5:24am

Ok, I did some benchmarking.

Data is that STL file with 1.8M triangles.
As AxelN proposed I am converting the indexed strips to indexed triangle lists - maintaining the cache friendlyness.

glMultiDrawElements(GL_TRIANGLE_STRIP) -> 80 fps
glMultiDrawElements(GL_TRIANGLES) -> 86 fps

Cool! But, now I’m rendering this data 10 times (=18M tris), simply by calling glMultiDrawElements 10 times.

glMultiDrawElements(GL_TRIANGLE_STRIP) -> 9.0 fps
glMultiDrawElements(GL_TRIANGLES) -> 8.6 fps

Interesting, huh?

I also tried some “real world” engineering data (= high batch count). Using triangle lists is never significantly slower, and sometimes slightly faster. Unfortunately, it seems never to get faster, if the data size increases, just as seen above. So in real world, the improvements are not noticable.

Now I’m using a somewhat complicated fragment based point light shader instead of fixed lighting:

glMultiDrawElements(GL_TRIANGLE_STRIP) -> 7.2 fps
glMultiDrawElements(GL_TRIANGLES) -> 5.1 fps

So if I didn’t do something wrong, I can not confirm, that using triangle list is always faster (on my machine of course).

(Testing with glDrawRangeElements will take some time, since it needs rearranging my VBO layout.)

Any suggestions?

CatDog

Dark_Photon · January 29, 2008, 9:11am

Well, first, make sure you are vertex bound. All of this discussion is geared toward improving vertex throughput (and to a small degree, reducing CPU load, with Multi vs. non-multi). Does nothing for fill. So first thing, shrink your window to 1x1 pixel (or as small as you can get it) to hopefully insure you are not fill bound. If you are in any of the above, the times don’t prove anything. If you shrink your window and times improve, you are.

Now, you still could be either vertex or CPU bound here (we’d like to be vertex bound for the test). In the case of your high batch count data, you may be CPU bound, which means GPU pipeline bubbles just due to that and your vertex cache efficiency is only mildly relevent. Also even if you’re not CPU bound, strips to tris alone is unlikely to net you a big win unless you re-optimize your triangle order. Don’t just convert strips to triangles and dump them in the buffer.

Hope this gives you some ideas.

imported_bobvodka · January 29, 2008, 9:25am

Just a small point, AxelN proposed swapping your glMultiDrawElements call for a glDrawElements call for the GL_TRIANGLES, which with a 32bit index buffer is a single draw call (well, I say that, it’s more about vertices than triangles but meh, you get the idea).

knackered · January 29, 2008, 12:50pm

catdog, either use d3dxOptimizeFaces/Vertices or download this:-
http://www.deep-shadows.com/hax/3DRipperDX.htm#Download
Install it, and browse into the installation directory - and nick a file called VCache.h. It contains an implementation of Tom Forsyth’s optimiser, which you can read about here:-
http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html
Be warned though, this particular implementation is quite slow at processing the data - but you can re-factor it to get the speed up.

CatDog · January 30, 2008, 5:30am

Dark Photon, I’m not CPU bound. The third test may be fill bound, since this was an expensive fragment shader, but the first two were vertex bound, definitely.

Next, I’m going to try to optimize the meshes using Forsyth’s method. Thanks for the links, knackered, this will speed things up for me!

I’ll come back here soon.

CatDog

CatDog · February 17, 2008, 9:31am

Small update.

I tried out the VCache.h implementation. It produces good triangle lists, but they are not really faster then my old strips. It depends heavily on the scene. Sometimes it’s 5% faster, sometimes not. But the problem is, that this particular implementation only works with small meshes. I tried to feed it with my STL files with 2-5 mio tris… and no way! This would run for days! (Ok you warned about it knackered, but… looking at the code confirmes quadratic runtime behaviour! Ugh.)

I’m now thinking about trying d3dxOptimizeFaces(). Since I never used it, could someone please tell me something about its performance?

CatDog

CatDog · February 17, 2008, 10:13am

Oh, I was a little bit too hastily with my judgement. I just loaded a small (500.000 tris) file. It took half an hour to optimize, but it renders at 140% compared to my tri strips!!

Wow, asap I will take a closer look at it!

CatDog

Dark_Photon · February 18, 2008, 5:03am

You’ll only see the maximum speed-up on scenes where you are vertex bound. For others, you may see no benefit at all from vertex cache optimization if vertex transform overhead is not your bottleneck. That’s the breaks of optimizing a pipelined system.

But the problem is, that this particular implementation only works with small meshes. I tried to feed it with my STL files with 2-5 mio tris… and no way! This would run for days!

Code up the Forsyth algorithm from his write-up. Linear time if you implement his performance tweaks (sounds like VCache may not have those optimizations, which knackered basically stated), and shouldn’t take long. Then you can move on to more interesting problems.

I’m now thinking about trying d3dxOptimizeFaces(). Since I never used it, could someone please tell me something about its performance?

No clue. Thankfully, my day job is 100% OpenGL. Do you really want Direct3D in your tools pipe?

CatDog · February 18, 2008, 6:27am

No! (Well, as long as I use OpenGL. In fact, the existance of such a routine in D3D is one more thing on my contra list. But that’s another topic.)

I just wanted a quick way to test how much I can get from this kind of optimization.

You’re so right, there are more interesting problems - I have to find the time somehow in between.

CatDog

knackered · February 18, 2008, 10:17am

d3dxOptimizeFaces uses the hoppe method and is quite fast at sorting the faces, but not as fast as the forsyth method (when optimised, not like the vcache.h implementation). I couldn’t measure any difference in render performance between the two algorithms, so if I were you I’d use the forsyth method (but not the vcache.h version in production code, just use it for inspiration).

CatDog · March 5, 2008, 4:48pm

Last weekend I spend some time on the topic again.

I’ve got my own implementation of the forsyth method now. Very good, I get increases of performance of 5-30%!

Since it’s a greedy algorithm, its success depends on the order of the input triangles. So I rewrote my old tristrip generator and passed its output to the forsyth thing. That gave me another 5-25% increase, especially with huge meshes! I’m guessing that using striped triangle lists as input will decrease the chance of cache misses. It depends on the original data, but with unsorted triangle clouds, pre-sorting seems to work well (although this needs some further testing).

Then, I’ve replaced glMultiDrawElements() by a loop over glDrawRangeElements() - of course while keeping the limits given by MAX_INDICES and MAX_VERTICES. That resulted in no performance gain. But I noticed a blatant drop, when exceeding those limits, not only when using glDrawRangeElements, but also with glDrawElements()! Obviously, these limits should be regarded for all kinds of element arrays, no matter which draw command is used.

Finally, and that really made my day, all these changes seem to solve a completely different problem that was bugging me for two years now!

Complete success!!

Thanks - especially to Dark Photon and knackered!

CatDog