efficiency of drawing with and without VBO

dovkruger · October 20, 2006, 5:06am

When drawing vertex by vertex, it is more efficient to use a triangle strip where possible vs. a series of triangles because there are fewer calls to the API, ie:

glBegin(GL_TRIANGLE STRIP);
glVertex3f(x1,y1,z1);
glVertex3f(x2,y2,z2);
glVertex3f(x3,y3,z3);
glVertex3f(x4,y4,z4);
…
glEnd();

vs.

glBegin(GL_TRIANGLES);
glVertex3f(x1,y1,z1);
glVertex3f(x2,y2,z2);
glVertex3f(x3,y3,z3);
glVertex3f(x2,y2,z2);
glVertex3f(x3,y3,z3);
glVertex3f(x4,y4,z4);
…
glEnd();

When the array of points is preloaded in a VBO, is this relevant? Is there any difference between the two forms?

When drawing a grid, is it better to draw a list of triangles, or a triangle strip where at the end of each row, a bogus triangle must be drawn with alpha=0 to avoid connecting the end of the first row with the beginning of the second.

Nychold · October 20, 2006, 6:02am

My guess would be yes, although I haven’t tested it. Here’s why:

Internally to the GPU, there are a lot of register operations, including loading. When it goes to render a triangle (under GL_TRIANGLES), it would most likely simply load all three vertices, and begin drawing. That’s nine register loads per triangle (x,y,z), or 9t loads.

If you’re rendering with GL_TRIANGLE_STRIP, the program only gets one different vertex, so it might be able to keep the other anchoring two vertices in the appropriate registers, without having to reload data that won’t change. This gives you 6 + 3t loads, which is a significant savings for t>4 (where GL_TRIANGLES yields 36 register loads, and GL_TRIANGLE_STRIP yields only 18).

AlexN · October 20, 2006, 8:23am

When drawing indexed primitives (glDrawElements or glDrawRangeElementsEXT) I don’t think there is much difference between triangle lists or triangle strips. Drawing in immediate mode (your example) would be faster using triangle strips. If you use glDrawArrays to draw (unindexed primitives), then triangle strips would probably be faster because it can reuse vertex transformation results despite not having indices.

As for drawing a grid, you could try setting alpha to zero on the connecting triangle and using alpha test with alpha > 0, but I would recommend using degenerate triangles to stitch the rows together. ie triangle A,B,C and triangle D,E,F which are unconnected could both be drawn with a single triangle strip that contains this set of indices: A,B,C,C,D,D,E,F (four zero area triangles in the middle)

RigidBody · October 20, 2006, 9:41am

the main advantage of a triangle or quad strip is that less vertex transformations, which are matrix multiplications, are needed.

knackered · October 20, 2006, 12:37pm

i thought it didn’t matter if you use strips or lists, so long as the indices were cache coherant. For example, if the second triangle in a list uses 2 of the vertices from the first, it’s effectively as fast as a tristrip because those transformed vertices will still be in the cache.
Anyway, what’s this got to do with VBO?

[edit] oh, he’s not using indices - big mistake man.

RigidBody · October 20, 2006, 1:02pm

hmm…i know you love them display lists, knackered, but are you sure that a display list stores transformed vertices?

dorbie · October 20, 2006, 9:36pm

Look at the context, this is not about display lists. He’s talking about vertex cache on the GPU as indexed vertices are transformed in immediate mode in both cases. He’s probably right to a degree (although I wouldn’t bet on it without testing). Of course 3X the indices have to be processed and you’ll take a hit there depending on other factors. Framebuffer and texture cache coherency also come into play so YMMV even if your test works.

knackered · October 21, 2006, 2:08pm

Originally posted by RigidBody:
hmm…i know you love them display lists, knackered, but are you sure that a display list stores transformed vertices?
Who mentioned display lists? I surely didn’t.

And of course a display list doesn’t store transformed vertices, or else how the hell would you move the viewpoint, for a start?

RigidBody · October 22, 2006, 10:20am

Originally posted by knackered:
Who mentioned display lists? I surely didn’t.
hmmm…my fault. you said something about lists, but now i guess you meant lists of triangles, not display lists.

Originally posted by knackered:
And of course a display list doesn’t store transformed vertices, or else how the hell would you move the viewpoint, for a start?
thats’s exactly the question which came into my mind before i asked:

Originally posted by RigidBody:
are you sure that a display list stores transformed vertices?
never mind- i don’t know enough about drivers and hardware to know what exactly happens between glNewList and glEndList.

zeoverlord · October 22, 2006, 1:00pm

Originally posted by knackered:
And of course a display list doesn’t store transformed vertices, or else how the hell would you move the viewpoint, for a start?
Actually they do, or rather can if the display list is compiled and nothing changes, those transformed vertices’s are still transformed by the view matrix at rendertime.
Though on modern hardware this is a moot point since the default thing is to transform them in the gpu and it gets no real speed increase since everything is still fillrate limited.

Humus · October 22, 2006, 1:09pm

Originally posted by dorbie:
Of course 3X the indices have to be processed and you’ll take a hit there depending on other factors.
Of course there’s more bandwidth, but on the other hand, the indices are much smaller than the vertices anyway in the vast majority of the cases, so this normally has very little effect on performance if any.

Brolingstanz · October 22, 2006, 6:01pm

Of course there’s more bandwidth, but on the other hand, the indices are much smaller than the vertices anyway in the vast majority of the cases, so this normally has very little effect on performance if any.
I think people want to pretend it makes no difference in an effort to simplify their lives

This clearly depends on the target hardware, as indicated by the lengths Nvidia has gone to with their tristrip library, coupled with numerous recommendations in several performance white papers.

In the worst case scenario you’ll gain nothing with strips, irrespective of hardware, so you’ve really got nothing to loose, and possibly much to gain.

If you’re targeting only the very latest hardware, then it may not be such a big deal, but there’s only one surefire way to find out.

dovkruger · October 23, 2006, 9:13am

I did not detect any hint of a consensus here.
I have a bunch of x,y,z points in a grid, where only the z and color change. I wish this to be as fast as possible on a variety of architectures. It still should work on implementations without VBO, so I would appreciate seeing opinions on:

What is the most efficient way to draw the grid if it has to be sent from the CPU each time. Perhaps this is some indexed scheme to reduce the bandwidth?
What is the most efficient way to draw the grid using VBO? One interleaved array of points? Indexed? Not? Since I’m new to this, I’d appreciate the sequence of the calls you recommend.
If I’m going to do this using Shaders, does the way I store the data change in any way?

knackered · October 23, 2006, 10:22am

if you’re new to this (and basically you’re asking what’s the best way to draw a grid, so I’d agree you’re new to this), then you’re asking in the wrong forum. Go to the beginners forum, not two clicks away.
(you may also try being a little grateful for any contributions people make, regardless of whether you think they’re of help or not)

Zengar · October 23, 2006, 1:38pm

Strips are faster because less vertices have to be processed via the vertex shader (1 per triangle vs. 3 per triangle)
Indexed strips are even faster because transformed vertices get cached. On nvidia hardware, there is 16 to 24 element cache (no idea what is the correct number for current hardware). This is what nvisia tristrip library does, optimizing the mesh for optimal cache hits.

Hovewer, as applications are usually more fill-limited then vertex-limited…

(I know you already said most of this, but I wanted to make it clear to the thread starter ^^)

Humus · October 23, 2006, 3:05pm

Originally posted by Leghorn:
In the worst case scenario you’ll gain nothing with strips, irrespective of hardware, so you’ve really got nothing to loose, and possibly much to gain.
Not true. Triangles are more flexible than strips since each triangle is independent. If the set of triangles you’re rendering aren’t easily connected in strips using triangles may very well turn out to result in a smaller index buffer and/or have better caching behavior. Plus that triangles being independent means you don’t have to fuss around with degenerate triangles to stitch together strips, which mean you’re processing fewer triangles, which could matter if you’re setup limited.

With all that considered, in the vast majority of the cases the triangles vs. strips is a moot point since you’re unlikely to be limited by that anyway. Instead of a strict vertex cache optimization it’s usually more useful to do a balanced vertex cache and HiZ optimization, such as done with the Tootle library .

jide · October 24, 2006, 10:50am

And I’d like to add that strips only work under one circumstance: when all vertices share the same normal (and of course texture coordinates, but that turns out I guess).

Brolingstanz · October 27, 2006, 9:49pm

Triangles are more flexible than strips since each triangle is independent.
That’s not right; it’s not even wrong

I can see the simplicity argument, perhaps, but not necessarily flexibility.

It’s conceivable that the setup costs could potentially outweigh any bandwidth/cache savings in the pathological case, say for triangle soup, where the number of triangles could roughly double due to degenerate insertion. Even if the hardware is able to quickly detect and reject degenerates, there’s a small setup cost, so this is a valid point and certainly worth testing.

But of course strips make a lot more sense for terrain tiles and other strongly connected meshes, not for triangle soup. For large collections of weakly connected triangles, I’d be inclined to go with lists, rather than stubbornly render all world geometry with a single primitive type (as tempting as it may be).

Anyway, the sorely beaten point I was trying to convey to the OP and apparently bungled is that the triangle strip was created with efficiency in mind! Strips offer nothing new: a strip is just a list of triangles in a more compact form. With that in mind, it seems reasonable to me to use them where they make sense, where judicious analysis and testing reveal a benefit. Now, when IHVs unanimously proclaim “The strip is dead! Don’t use strips!”, then I will happily give them up; but I sincerely doubt such a proclamation would apply equally to all present, past and future hardware across the board.

P.S. Tootle? Wasn’t that a candy of some kind?
Tootle Sweet, Tootle Sweet, the wonderful candy that’s fun to eat?
?

zeoverlord · October 28, 2006, 9:47am

Well the strip is sort of dead on todays hardware and even more so on the next generation hardware where polygons have to pass trough a geo shader.
To simply put it, the hardware finds it easier to do all the optimizations itself rather than relying on the user, sure tri-strips give potentially higher performance than regular triangles, but then why isn’t the GL_SUN_mesh_array extension(with a few modifications) then put into the core since it would definitely increase performance on strongly connected uniform meshes?, the answer is of cause “because the increase is still to insignificant for IHVs to even bother”.

Though with the addition of the PBO extension i think this should be reconsidered.

Dark_Photon · October 31, 2006, 4:20am

dovkruger, ultimately you need to do your own performance testing to prove to yourself which method is better. But if you don’t, just use well-optimized indexed TRIANGLES, with VBOs when possible. Talking about performance when using immediate mode (glBegin…glEnd) is probably pointless.

Look at the diagram on page 21 here:

OpenGL ES Hardware (Bitboys)

Then google on post-T&L vertex cache.

With tri-optimization, you’re addressing vertex transform throughput and vertex bandwidth, in that order. If you’re fill bound now, this is all academic and you won’t see any difference.

Re NvTriStrip, great training wheels library, but “very” slow, and does not handle degenerate geometry (e.g. edge with > 2 face neighbors, sometimes used in lower-LOD models).

Now read this:

Fast Vertex Cache Optimization (Tom Forsyth)

Just spend a few hours to read it, understand it, and implement it. You’ll be glad you did. It’s much faster and doesn’t fall flat on degenerate geometry (in fact, it doesn’t care about mesh topology).

Now, having optimized for vertex transform throughput with optimized indexed TRIANGLE primitives, what about using indexed TRIANGLE_STRIPs to perhaps save some index bandwidth? You don’t want to bust up your draw calls (one glDraw*Elements call per strip) for best performance. So that leaves NV_primitive_restart or degenerate triangles. The problem with the former is:

[li]Only works on NVidia hardware[] All vertex attributes must be in VBOs for this to be faster []Must use accelerated vertex formats, strides, etc. for this to be faster

That leaves using degenerate triangles to connect strips. That may net you a little bit of bandwidth shuffling indices around, depending on the tri-optimizer output, but that’s it. Check it and see.

So just use well-optimized indexed TRIANGLES, with VBOs when possible, and benchmark everything else (including your immediate mode version and indexed TRIANGLE_STRIPs with degenerate tris) against that. If you find any method that’s faster, let us know!