VBO vs. Vertex Arrays on a Quadro FX 1500

  
for (int i=0;i<LOOP_ITERATIONS;i++)
    glDrawArrays( GL_TRIANGLES, 0, g_pMesh->m_nVertexCount );

vs.

for (int i=0;i<LOOP_ITERATIONS;i++)
    glCallList(g_DisplayList);

FWIW, the display list was created by recording the glDrawArrays call with the VBO data.

Unless you can identify the the incorrect usage of VBOs in NeHe Lesson #45, there actually is a measurable difference between VBOs and DLs (at least on a Quadro FX graphics card). I would really like to know if anyone can make this data render faster using a VBO instead of a DL.

Ah, sorry, I meant, that on common GeForce hardware (not Quadro) the difference between VBO and DL is unnoticeable. I can’t say anything ot Quadro, sorry again.

Further investigation shows that VBOs and DLs display at about the same speed if the object has ~128K triangles in a single draw call. Anything less and the DL is faster. Anything more and the VBO is faster. This would indicate an internal blocking/batching factor difference. On a QuadroFX on XP, VAs are faster in all cases that I have tested - go figure.

Considering that most of my objects are static and have less than 100K triangles, I see no benefit to using VBOs. Longs Peak needs to understand these metrics and either retain geometry-only DL technology or improve VBOs so that they display more efficiently for smaller objects. If there is a difference between Quadros and GeForce cards, one would think that the professional card would simply do the fastest thing regardless of which path was taken.

Originally posted by tranders:
If there is a difference between Quadros and GeForce cards, one would think that the professional card would simply do the fastest thing regardless of which path was taken.
The professional cards and theirs drivers are optimized for use in professional modeling programs. If such programs use VA and DL to draw huge amount of geometry most of the time, the driver behavior (e.g. memory allocation strategy) will be optimized for such operation. The gaming cards on the other hand will have theirs drivers optimized for methods used by popular games so the VBO path might be the more optimized one.

I’m not at all familiar with how games display their data. I assume that it could be built into one giant (or several large) VBO and subsets could be managed as components change. That would also lend some credibility to the support of instancing. However I still think that (for an identical data set) a professional driver should take additional steps to optimize the DL for the maximum performance (e.g., allocate a large VBO behind the curtain if that will improve performance). Application users pay a premium for these cards so they should expect them to be fast regardless.

It would be interesting to see if there is a similar break-even point on the GeForce cards.

It used to be that the “professional” stuff was geared more towards geometry, and “gaming” more towards pixel speed.

But nowadays, when games throw hundreds of thousands of triangles, thousands of state changes and hundreds of both vertex and fragment programs at the card every frame and still get interactive speeds, and have on-board memory sizes at half-a-gigabyte and sometimes more, I wonder what the measurable benefit really is using “professional” cards.

Is it that their drivers are less buggy, or simply more precise (both internally and on the card)? I’m thinking like going float->double in your program, and possibly switching the x87 FPU to 80-bit precision, i.e. getting more precision at the cost of speed.

Related, but a bit o/t, I just the other day tried 3DSMax (8) using OpenGL mode on my 7600, and OMG was it buggy! :slight_smile: Perhaps this is an area a Quadro and matching drivers would have been better, perhaps it’s a bug in Windows when using (even if completely hidden) layered windows (the OS provided kind that alpha-blends), or simply it’s a Max bug. Either way, both software and D3D mode worked as expected.

Originally posted by tamlin:
But nowadays, when games throw hundreds of thousands of triangles, thousands of state changes and hundreds of both vertex and fragment programs at the card every frame and still get interactive speeds, and have on-board memory sizes at half-a-gigabyte and sometimes more, I wonder what the measurable benefit really is using “professional” cards.

The professional cards have drivers certified for various 3d modeling applications so it is possible that there is better support from vendor of the application if problem occurs when using driver version that was certified with that program.

Some applications (e.g. 3DS Max, AutoCAD) also support special application drivers which can be used instead of the OGL backend (e.g. MAXtreme drivers from Nvidia). From what I read this driver can significantly increase the performance of the modeling related tasks.

Additionaly Quadro series of cards has some features (altrough some from them are likely to be driver only) that are usefull for professional applications such as overlay planes (for more efficient visualization of selections in high polygon geometries), unified back buffer (allows more efficient usage of video memory in applications utilizing multiple OGL windows) or support for synchronized swapping of multimonitor output and OGL stereo. I think that some old Quadros also supported OGL logical operations. It probably also has better support for antialiased lines and wireframe rendering.

Originally posted by tamlin:
Related, but a bit o/t, I just the other day tried 3DSMax (8) using OpenGL mode on my 7600, and OMG was it buggy! :slight_smile: Perhaps this is an area a Quadro and matching drivers would have been better, perhaps it’s a bug in Windows when using (even if completely hidden) layered windows (the OS provided kind that alpha-blends), or simply it’s a Max bug. Either way, both software and D3D mode worked as expected.
I’m surprised. nVidia tended to do everything perfect.
Even ATI works well. I remember in rare cases on a Radeon 9500, it would crash when line rendering was used. At times, you get random lines all over the screen.

RigidBody had some decent numbers there, showing VBO is better. Perhaps the Nehe code is not good.

Originally posted by V-man:
RigidBody had some decent numbers there, showing VBO is better. Perhaps the Nehe code is not good.
Well, as far as I can tell, you need at least ~2000 vertices in a draw call for VBO to be as fast or faster than general vertex arrays.
This has been tested on ATI, so it may be a bit different on nVidia hardware. Size of the VBO doesn’t matter but shouldn’t exceed a few MB.

This might be the case, because the call to gl*Pointer functions is really expansive for whatever reason.

It noticed this, while i wanted to skip conventional vertex arrays and go for VBO all the time in my current engine. After some experimenting, i was pretty dissapointed on the performance of VBOs (most of the time you’ll end up rendering less than 2000 vertices at once). They are only fast, if you can batch a lot of geometry into a single draw call. Otherwise they are even slower than VA!

How fast are D3D vertex buffers compared to such cases? Does it behave equally or is it even faster for small buffers? (I bet it’s the latter:( )

IIRC NVIDIA recommended several years ago (!) batch sizes of 10k-20k. 2k is today to be considered so small the overhead of setting up the buffers may dwarf the actual transactions.

AFAIK if mapping the buffers, the overhead is much larger than if simply uploading manually (possibly that’s where the sometimes suggested “upload instead of mapping” comes from?). But whether mapping or uploading manually, once both vertex and index data is on the “server” side, it should be much faster drawing it (only sending commands) than using VA’s (sending both commands and data).

The only idea/advice I have is: Collate, collate, collate. Put as much data into the buffers as you can. If you have more than 64K indices (in case the ushort limit kicks in) or you have put many index buffers into one index VBO but all of them start at vertex zero (rebasing them on CPU is possibly faster, but then we may reach the ushort limit), you can for each batch re-base what the server considers index-zero (using e.g. glVertexPointer) to have what’s at offset 47911 in the buffer be considered vertex[0]. (note: 47911 is obviously a bad choice for an offset to start a vertex at :slight_smile: Try to keep it at least 8-byte, but possibly even 32-byte aligned, especially with a 256-bit memory bus where 256/8=32)

But indeed, you need “larger” amounts of data for VBO to be efficient. Even Begin/Vertex*N/End can probably be (is?) faster than VBO for small batches.

Rounding off: Unless you were aware of it, always set the vertex “pointer” last, just as if you had done immediate drawing. The majority of the (required buffer-) work is done when setting the vertex “pointer”, why no other attribute “pointers” should be modified after it. This implies that if using multiple batches of vertex attributes in a single VBO, always end current batch with unmapping it so the driver knows it does no longer have to track the other attribute “pointers” for this batch. Else every following e.g. glNormalPointer could trigger much work, that was really intended for the next batch of vertices.

++luck;

Originally posted by tamlin:
This implies that if using multiple batches of vertex attributes in a single VBO, always end current batch with unmapping it so the driver knows it does no longer have to track the other attribute “pointers” for this batch. Else every following e.g. glNormalPointer could trigger much work, that was really intended for the next batch of vertices.
Thanks for your detailed reply. But:
What exactly do you mean by “unmapping” the current batch?

How fast are D3D vertex buffers compared to such cases? Does it behave equally or is it even faster for small buffers?
Due to the design of the D3D driver model, almost certainly not.

Every D3D DrawPrimitive call makes a call into the driver, which will provoke a CPU switch from protected mode to kernel mode. This switch takes a long time (relatively speaking). An nVidia paper a while back suggested that you get approximately 100,000 such calls with a 1GHz CPU (since it’s CPU-time limited).

By contrast, calling glDrawElements does not always require a kernel mode switch. The OpenGL implementation can marshal such calls so that they happen when the GPU is running out of stuff to do, thus provoking fewer kernel mode switches.