VBO much slower than display list?

I’m trying to migrate the drawing of a digital elevation map from a single massive display list (containing numerous triangle strips) to a number of smaller VBOs which can be drawn as needed. The goal is to cut down the number of vertices which need to be processed per render, based on visibility.

The first step, obviously, was to get the data into vertex arrays that could be drawn as one massive triangle strip, and to put that in a VBO.

So in my code that previously got compiled into a display list, I replaced all my glEnds with duplicates of the last vertex, and all my glBegins with duplicates of the next; thus linking seperate triangle strips with degenerate triangles. Assume that’s all working fine.

Anyway, I downloaded this list of about 250,000 vertexes into a VBO and replaced my glCallList with a glDrawArrays + setup.

And now the code runs about four times slower than it used to.

There has always been a strong link between color and position in this code; in the display list case, each vertex was drawn as:


glColor4d(j,i,zdata[i][j],1);
glVertex3d(j*20, i*20, zdata[i][j]);

Now I just set the glColorPointer to be the same as the glVertexPointer, and the fact that the color is now *20 in x and y is handled later.

Any idea what’s going on?

If you have doubles in your VBO that might be the problem. The hardware only supports floats and would have to convert all the data in the VBO for each draw call while it would have to convert it only once in the display list case.

[ www.trenki.net | vector_math (3d math library) | software renderer ]

Everything is floats to start with.

I’ve determined that a display list containing the same data as the VBO is just as slow.

This means that a display list containing a single, massive triangle strip is a lot slower than a display list containing many smaller triangle strips. Why would that be?

One quick note, you were compiling your previous case into a display list, which the drivers are going to transform to the optimal case for static rendering.

You might want to give a little more info on how you are setting up your VBO (what is the data format, how is it interleaved, is it static, are you mapping your VBO and writing to it, or copying to it, etc). This info will be needed to try and figure out the factor of 4 speed difference you are seeing.


glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, nverts*sizeof(float4), pointlist, GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);

At initialization time, and


glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_COLOR_ARRAY);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexPointer(3,GL_FLOAT,sizeof(float4),BUFFER_OFFSET(0));
glColorPointer(4,GL_FLOAT,sizeof(float4),BUFFER_OFFSET(0));
glDrawArrays(GL_TRIANGLE_STRIP, 0, nverts);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glDisableClientState(GL_VERTEX_ARRAY);
glDisableClientState(GL_COLOR_ARRAY);

at each iteration.

I’ve now tried compiling a VBO call into a display list, compiling a VA call into a display list, calling the pointlist verts directly in a display list…as far as I can tell, the slowdown is occuring due to the fact that I’m rendering one large triangle strip rather than many small ones.

Is it possible that display lists do some sort of visibility test on the individual primitives that compose them? Each render can only see a very small area of the whole geometry, so some kind of automatic culling is a possibility.

How bad would it be to compile display lists on the fly, rather than at initialization time? My end goal here is to be able to handle terrain which is too big to fit in graphics memory all at once. Originally I had thought to use a quadtree which would store VBOs of some number of verts at the leaves, but perhaps storing display lists there would be faster? (That’s the reason I haven’t used indexes for duplicated vertices…much harder to subdivide such a triangle strip.)

Display lists might do culling. nVidias drivers are known to heavily optimize display lists. On ATI hardware you usually don’t benefit much from them.

VBOs are definitely the way to go. I have done a city-renderer with over 5.5 million vertices (over 2 mio triangles), that was able to run at very high framerates, even if EVERYTHING was visible, using VBOs.

Of course, to get more performance you need to use some sort of culling. Quad-trees are a very good solution for terrains.

To get the maximum throughput out of VBOs you need to partition your data into chunks that contain only up to 2^16 vertices and then use 16 bit indices. The indices themselves need to be stored in VBOs, too, so that you can render a whole batch with one draw-call without the need to transfer “anything” (almost) over the bus.

For optimal speed you should also not draw the WHOLE buffer with one drawcall. Instead partition them, so that you draw more than 300 triangles per drawcall, but less than, say, 20000 triangles per call (more won’t be possible with only up to 2^16 vertices anyway). That allows even hardware that seems to ignore the range-parameter in glDrawRangeElements (hint: ATI), to swap out buffers, that are not currently in use and thus it can much more efficiently stream buffers from RAM.

Implementing such a renderer is still quite a lot of work and you need to care for many details. However, it will pay off, since it will work very well on all hardware and it is THE way to do it, so you can reuse it in future projects. Also in OpenGL 3.0 there will be no more display lists, as you know them, and what remains will focus on small pieces of geometry.

Hope that helps,
Jan.

PS: Try using unsigned byte for colors, instead of floats. If your slowdown is bandwidth-related, it will improve things.

Is it possible that display lists do some sort of visibility test on the individual primitives that compose them?

Yes.
I once had 500k polygon truck model stored in single display list. My observations:

  1. Rendered far away - small on screen, average framerate
  2. Zooming in - entire truck is still on the screen, but much larger - worse framerate because more pixels need to be drawn
  3. Zooming in more - half of the truck is outside screen - framerate comparable to #1
  4. Zooming in even more - we look at only a fragment of truck and framerate is much better than in #1.

I wasn’t using strips - only triangles, but I think driver splitted my object into smaller parts and performed frustm culling on them.
Of course performance gain when zoomed in could come from other factors. For example if you process vertices and turns out entire polygon is off-screen, then you don’t pay for triangle setup. When zoomed out it can have 0 pixels, but you still pay cost of setup for rasterization.
However, performance increase between case #1 and #4 were just too big to be just that. Or maybe triangle setup really takes noticeably more time than fetching and processing a vertex? It’s pssible since it’s not trival either.

I can confirm that nvidia’s display list compiler is incredibly good. I’ve had a hard time getting comparable performance with VBO’s. The engineering data I read in is already tri-stripped, but not very well, so we have lots of small (4 or less triangles) strips. Even if I stitch them together with degenerates and make VBO’s, I still can’t get near the display list performance. The only way I got similar performance was to convert them into triangles, run them through a tristripping library and threw them into a VBO. All that effort to achieve the same performance as nvidia display lists.
And yes, they do of course do frustum culling.
It’s why I believe in keeping a form of display lists in GL3. The IHV is in the best position to decide the optimal buffer layout the geometry should use. They can drop all the other state stuff from display lists, just keep the “optimise this static poly soup for your hardware” part.

Just out of curiosity, what kind of hardware is that on, Quadro by chance? Do those boards actually fold in some culling for you, in addition to everything else? Never had the privilege…

P.S. Never mind, I see you just answered my question. Back to shuffling my melba toast…

I’m working on a Quadron 4500, at least. (Don’t know about anyone else.)

I’m curious about the reference to a tristripping library. I did briefly look at NvTriStrip, but quickly decided it was too much trouble to figure out. (This is a frequent problem I have with using 3rd-party libraries that I’m not specifically designing to.) Is a thing of that sort actually useful?

Cant’ speak for anyone else and of course YMMV but I throw caution to the winds and render nothing but triangles and get satisfactory results.

Used to spend a good chunk of time reordering vertexes (stripping and fanning and such) but not anymore. You might gain something considerable from it but for me the generality of triangles is irresistible in my particular scheme of things…

Course this doesn’t mean you can’t reorder verts, it just means you might not want to be unduly worried about it if your measurements aren’t overly compelling.

I use quadro’s pretty much all the time. I think it’s down to the drivers though, not the hardware.
Lindley, I use gpsnoopy’s tristripper.
http://users.telenet.be/tfautre/softdev/

My particular application doesn’t draw to the screen, so performance is a real concern, beyond the limitations imposed by vsync.

I only mention this because the “100+ FPS should be good enough for anyone” line of thought usually comes up sooner or later in speed-related threads.

I will first echo what folks are saying about NVidia optimizing their display lists. My display lists render faster than the same exact data set when rendered as a vertex buffer object on NVidia but much slower on ATI. If you are going to be using DrawArrays on an NVidia card you won’t see much difference in fact in some cases I actually saw a decrease in performance. But the benefit of the VBO’s is you don’t have to draw the entire selection of triangle strips only what is visible to the current view frustrum. A couple of things to verify if you do set up frustrum based calls to DrawElements make sure your index array is stored in a VBO as well for optimum speed. If you will never be rendering a limited set and will be rendering static arrays you always have the option of retaining your display lists and consider drawing the primitives inside your list with DrawElements or DrawArrays. This will get you around using glBegin / glEnd and seemed to make certain parts of my code run faster on ATI cards.

You have this code:
glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_COLOR_ARRAY);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexPointer(3,GL_FLOAT,sizeof(float4),BUFFER_OFFSET(0));
glColorPointer(4,GL_FLOAT,sizeof(float4),BUFFER_OFFSET(0));
glDrawArrays(GL_TRIANGLE_STRIP, 0, nverts);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glDisableClientState(GL_VERTEX_ARRAY);
glDisableClientState(GL_COLOR_ARRAY);

But isn’t this slow?
Why enable and disable the arrays at each iteration? why not do them once instead of every frame.
Also, you call glVertexPointer every frame, shouldn’t that only be called once when you are building your VBO’s? And opengl will remember your values when you bind the buffer, like it does with textures.
And why call “glBindBuffer(GL_ARRAY_BUFFER, 0);”

That might go bounds to explain the speed advantage of the display list. I’d guess that NVidia is doing some form of reordering of the triangles and combining of the same vertices. This way it makes the best use of the pre-vertex attribute cache, and vertex cache.

There is a paper called Linear-Speed Vertex Cache Optimisation by Tom Forsyth which gives a great outline on this stuff.

ive heard ppl argue that VBOs are faster than DL but logically lists are gonna be as fast or even the fastest method ie whats to stop the driver going

glCreateList(…);
(in driver) create a VBO of the data :slight_smile:
glEndList();

i believe the IHVs are against DL’s since they add complications to the drivers, less complicated drivers benifit all

Geometry-only display lists should add negligible complexity to a GL3 implementation. At the very least it would just have to implement them under-the-hood in the way zed describes. But having the mechanism in the API gives the IHV a great opportunity to out-perform their rivals - they get the whole problem from which to find a solution. The argument’s analogous to the benefits GLSL/HLSL has over asm.
Granted, you probably want some assurance that it will pick the optimum layout, but you’ve no guarantee at the moment that a particular VBO/IBO combination is going to be optimal for all hardware. You go off experience and benchmarking.

Really, in GL 3.0 terms, a geometry-only display list is merely an alternative mechanism for building a Vertex Array Object (or, rather, an object that can be bound to the context in the same location as a VAO). The interface for them should look like:

“Given a VAO and the function call I was going to use to render with it, build me a display list that will do the same thing.”

Now, it might need a specialized entrypoint for executing the display list, since neither of the two current glDraw* methods are suitable. But I think that’ll work itself out.

BTW, I think one of the advantages of GL 2.1 display lists will go away with 3.0 as a basic necessity of the API. Because it is shader-only, and there’s no fixed-function, culling can’t work anymore. After all, culling is based on the expectation that the current matrix stacks will be used to render with, and those matrix stacks will be gone in 3.0.

You’ll still get a theoretically-more-optimal storage format.

you’ve no guarantee at the moment that a particular VBO/IBO combination is going to be optimal for all hardware.

But I’m fairly certain we will have that guarantee in 3.0. Just like with Format objects, VAO’s can fail to be created. Since VAOs store the base offset, stride, and data type (float, unsigned byte, short, normalized vs. unnormalized, etc), if a certain combination doesn’t work, you’ll know in time to switch your data format where necessary.

I suppose I could leave them enabled with no harm resulting. However, I have to call glVertexPointer and glColorPointer because there’s another portion of the render loop which uses a different VBO, with different offsets.

And opengl will remember your values when you bind the buffer, like it does with textures.

I don’t think so. If that were the case, drawing color from a different VBO than vertices would be impossible, since you’d have to bind a different buffer for each. That was recently said to be quite possible on this board. Pretty sure the vertex and color pointers are client state, while the bound buffer is server state.

And why call “glBindBuffer(GL_ARRAY_BUFFER, 0);”

Because certain functions are said to behave differently when a buffer is bound, and I don’t like surprises. Certainly glReadPixels does something different when the PIXEL_PACK_BUFFER is bound; since I don’t know precisely what effect leaving the ARRAY_BUFFER bound would have, I simply didn’t.