VBO enhanced performance

Drawing from one VBO and uploading new data to it can occur in parallel. Keep in mind that a VBO is an abstraction. Under the hood the driver can do buffer renaming and keep multiple buffers in flight as long as you’re replacing the entire buffer. The performance advantage of mapping is that you potentially save a copy. Other than that there’s nothing preventing glBufferData() from being equally fast.

The expectation is that an application might map a buffer and start filling it in a different thread, but continue to render in its main thread (using a different buffer or no buffer at all)…

Yes. But notice that nowhere in there does it say that you should not have a context bound in the other thread.

Buffer renaming is a server side optimization and applies equally to mapping after nulling the buffer with glBufferData or using glBufferData() directly to fill the VBO. From page 13 of the NVIDIA VBO white paper:

The pointer returned by with glMapBuffer() refers to the actual location of the data. It is possible that the GPU could be working with these data, so requesting it for an update will force the driver to wait for the GPU to finish its task.

To solve this conflict we you just need to call glBufferDataARB() with a NULL pointer. Then calling call glMapBuffer() tells the driver that the previous data are aren’t valid. As a consequence, if the GPU is still working on them, there won’t be a conflict because we invalidated these data. The function glMapBuffer() returns a new pointer that we can use while the GPU is working on the previous set of data

What I am refering to a client side technique. Only one glBufferData() can be active at one time on a GL context. The data transfer must complete before the function will return. While this may be the fastest way to transfer the data it is also likely that it will reduce the frame rate of the drawing thread if the buffer is large. If instead a mapped pointer is supplied to a helper thread the transfer can occur in parallel with the drawing thread on multi-CPU systems. In fact several threads could be filling VBOs in parallel if processors are available. It is limited by the number of CPUs, write combiners
Hyper-Threading Technology and Write Combining Store Buffers – Understanding, Detecting and Correcting Performance Issues or by saturation of the bus.

Issuing glBufferData() calls from helper threads is possible. It would require a GL context per thread and shared VBOs. Its performance advantage would likely be defeated by resource conflicts in the driver. I would be an interesting experiment.

I think procedural content creation gets really interesting in the context of multiple CPU threads, even CPU LOD might find its way back into the sun, given enough “spare” cores. Makes me wonder a bit…

Here are some performance testing statistics I gathered today.

Dell Inspiron 8400, 3.2 Ghz Pentium 4 hyperthreaded CPU, Windows XP sp2, NVIDIA 7800 GTX. 84.21 video driver

3.3 - 3.8 Gb/sec 16 Mb mapped buffer filled with memset() on a helper thread
1.3 - 1.4 GB/sec 1 Mb buffer filled using glBufferSubData()

Supermicro with two dual core Opteron 275 2.2 Ghz CPUs, Windows XP sp2, Radeon X1950XT. 7.10 video driver

3.0 - 6.0 Gb/sec 16 Mb mapped buffer filled with memset() on a helper thread
.6 - 1.0 Gb/sec 1 Mb buffer filled using glBufferSubData()

The mapped buffer higher number is the average with a lightly loaded drawing thread (one full screen poly). The mapped buffer lower number is with a 125,000 vertex load in the drawing thread.

Our normal procedural geometry generation averages about 300 Mb per second VBO fill rate using a helper thread. At this fill rate I measured a 3% improvement in drawing thread throughput. At the 3+Gb/sec transfer rate there was a 5% to 15% drop in drawing thread performance. I expect this drop indicates PCI bus contention between the threads. I have no explanation for the 3% performance gain with the lower transfer rate.

Are particular reason you didn’t compare with equally large buffers, 16MB vs 1MB, or is that a typo?

This is not a typo. I obtained this benchmark by small modifications to our application. The mapped buffer was available for a full overwrite without disturbing any other part of the application. The glBufferSubData call was limited to the last megabyte of the VBO which is rarely used by the drawing thread.

There is an overhead to the map/unmap call which is hidden by the size of the transfer. Our 16 MB double buffered VBO map swap consumes about 130 CPU microseconds. The glBufferSubData has a very low overhead. If procedural geometry was written to local memory and then transfered via a glBufferSubData call the crossover point for the fastest transfer depends on the buffer size. Buffers smaller than .3 to .5 MB would transfer faster with glBufferSubData. However, for our application, the absolute transfer rate is much less important than filling the VBO without a loss of frame rate.

There was a measurement error. 12-13 microseconds is required for a 16 MB VBO mapping swap. A glBufferSubData() transfer of 17 KB took 13 microseconds. Therefore transfering more than 17 KB in the drawing thread with glBufferSubData() causes a greater performance loss than mapping a 16 MB buffer.

First the good news :D.

The NVidia 169.21 driver fixes the VBO mapping performance problem. On a GeForce 8800 there are no issues. On a GeForce 7800/7900 the driver requires either a glFinish or a buffer discard by a NULL glBufferData call prior to remapping the VBO to get consistently good performance. The completion of a query object associated with the VBO is sufficient on the 8800 but fails intermittently on the 7800/7900. Of couse the buffer discard is the recommended method.

Now the bad news :sorrow:.

The ATI OpenGL driver rewrite seems to have broken VBO mappings. On the Radeon 2900 when attempting to draw from a VBO that was mapped then unmapped I get GL_INVALID_OPERATION. If the VBO is mapped as a pixel pack or pixel unpack buffer there is no error but performance is very poor. With the Catalyst 7.12 release the problem has spread to the Radeon X1950.

glBufferData() with NULL is a really good idea before re-writing buffers, no matter what.

Regarding multi-threaded buffer filling, I don’t see the performance gain, because if it’s a copy, then it’s a memory bus limited process, and having two threads fight for the memory bus might actually be slower than serializing the two fills (because of DRAM page open and streaming issues).

If you are generating the data (say, through skinning, or some other algorithm that is CPU bound), then multi-threaded filling might make more sense.

You would think so. But, we are seeing a timing jitter that may be related to the glBufferData() with NULL occasionly creating a new buffer. Since it is a large buffer it may cause the driver to reorganize memory or restart new usage heuristics. So far we get more stable timings with a glFinish() and no NULL glBufferData() call.

Yes. We are generating our VBO data with a CPU bound algorithm.

:smiley:
I found a workaround for our VBO bug with the Catalyst 7.12 driver. It appears our usage pattern confuses the driver’s VBO state management. The driver thinks we are attempting to draw from a mapped buffer. However, querying the GL state shows no pointers are bound to the mapped VBO. If I make sure there are no pointers bound to the buffer object when the buffer is mapped the driver’s GL state doesn’t get confused. A bug report has been submitted.

That’s not a bug.

Please provide the reason for your conclusion. I find only the following restriction.

From the OpenGL version 2.1 specification

2.9.1 Vertex Arrays in Buffer Objects
…Attempts to source data from a currently mapped buffer object will generate an INVALID OPERATION error.

Check out the example code for mapped buffer objects at the end of the ARB_vertex_buffer_object extension. You will find the VBO is mapped with bound pointers.

Yep, as I said, that’s a bug. :wink:
Throwing an invalid_operation error is incorrect behaviour.
Submit a bug report.