VBOs strangely slow?

Rob_Barris · March 2, 2010, 10:47am

ViolentHamster:

Rob Barris:

The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to rewrite some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.

Rob, I’m not sure I understand. You still need a sync before you go to draw though, don’t you? Should the application keep track of active and inactive VBOs (the GPU may be drawing with the active while the inactive VBOs are ready to be recycled)?

Rob_Barris:

The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.

I’m curious about your “dynamically generated batches”. Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them? I’d really like to be able to put all my vertex data directly into VBOs. Unfortunately, I have multiple LODs and I don’t know which LODs I need until I’m finished culling. If I put all my LODs into VBOs, I’d have hundreds of MBs of VBO data. I’m struggling with how to fill VBOs with the correct subset of vertex data while giving the GPU enough time between the upload and the draw call…

I’ll try to boil this down a bit. First let’s define a workload, then we look at how you can feed it to GL. If your app doesn’t match the workload, then this may not apply to you.

workload: say the CPU wants to draw a series of batches where each one is based on data generated or unpacked right before issuing of the draw request. Once written, the data is not going to be modified or read back by the CPU. The goal is to efficiently let the GPU have access to the newly written data, and to avoid bogging down with excessive allocation or synchronization on a per-draw basis.

( As a hypothetical example, say we’re using the CPU to deform and draw hundreds of falling leaves, where the leaf-shape algorithm runs on the CPU, and can be used to generate new batches of verts for each leaf at will )

So, you can do this with one VBO and no fences, and it can run really well. The magic is hiding in the buffer-orphaning step.

So make a VBO with glBindBuffer, and set its size with glBufferData. A few megabytes is good.

Init a “cursor / offset” to zero.

for each batch:

figure out how many bytes it will be.
round it up to some nice power of two multiple, 64 is good.

orphan current VBO if this batch won’t fit (see below).

map the buffer using UNSYNCHRONIZED, at the current cursor offset, asking for the padded number of bytes to be visible. (On Apple flush-buffer-range, you can map in unsynchronized fashion, you just can’t pick the range, so you always get the base address back - just add the offset to it)
write the data at the beginning of the mapped range.
unmap.
increment the cursor by the padded size used.
issue the draw call after setting vertex attrib pointers appropriately into the VBO, keeping the offset in mind.
repeat.

Note, if you are using an asynchronous or multithreaded driver, you might well get 40, 50, 100 batches written into that VBO (and draw commands enqueued) before the GPU even looks at that first byte. That’s OK. You just want the client thread to get in and out of that VBO as fast as possible so it can stay busy doing work.

At some point the cursor will have moved far enough such that the next batch of data will not fit - i.e. offset + padded size exceeds the total size of the VBO. Note the starred step above.

When this eventuality happens, and it will vary depending on the size of the batches you’ve been dropping into the VBO, the response is very simple.

orphan current storage by doing a new glBufferData using the fixed size chosen for the VBO, and a NULL pointer.
rewind cursor to offset 0.
continue.

The subsequent map result will look at new storage, a clean sheet. The old storage belongs to the driver, it’s no longer associated with the VBO ID that you have in your code. So from one point of view there are now two buffers of storage running around, but the one you orphaned can no longer be accessed by the client code. At some point all the draw calls that are consuming data from that storage will complete - and that storage will be freed or possibly recycled automatically.

In this model, the number of VBO’s known to the client is “one”. The number of floating (orphaned) blocks of storage could be much higher, depending on how long the GPU is taking to chew through each job and how fast the CPU can drop them off.

So you don’t have to juggle “multiple VBO’s”, you just need to keep blasting away at the one VBO while letting the driver swap in new chunks of storage as needed.

Client never needs to fence, or check GPU progress, or block on map.

Write&draw, write&draw, repeat til VBO full, orphan and rewind cursor, repeat. CPU gets to drop off all of its data and draw requests and potentially go on to do other tasks without a care as to how many orphaned buffers (storage blocks) wind up in flight or how fast the GPU is retiring them.

So in the hypothetical example, you might completely fill one buffer with leaf shapes (and have a draw pending on each one), orphan it, start pumping leaves into the VBO again starting at zero offset, process repeats. Are you getting ahead of the GPU by one or more blocks of storage? Maybe. Do you care? No. Let the GPU and driver catch up on their own time (ideally on an alternate CPU core). Keep that client drawing thread unblocked.

Driver only sees fixes size VBO blocks coming and going. Its job to recycle those chunks of storage is greatly simplified. Draw events should outnumber orphan events by some healthy multiple - only you know the likely spectrum of batch sizes. Orphaning 128MB VBO’s is probably too big. Orphaning 2-4MB VBO’s, no big deal.

Going back to your questions

You still need a sync before you go to draw though, don’t you?

Not in this style. You map, write the data, and unmap, you can issue a draw call on that data right away. (An async driver is just stacking up these draw requests to process in order). The key is that you get control back into your code as soon as possible so you can crank up the next batch’s data. You stay disconnected from any idea of how much work the GPU has done or is about to do.

There is subtlety that the “next batch” will usually be mapping the same buffer/storage, but you are not going to alter or step on any data previously emplaced - the ascending cursor sees to that. The world doesn’t end if the GPU reads from address A while you write to address B and they are different.

Again if your workload doesn’t fit this model, you would need to do more explicit sync effort possibly using fences to know “when” it is safe to touch any given region of storage. But if all you do is fill, fill, fill and then orphan and start over - you never need to check or sync. The juggling of multiple blocks of storage is all in the driver and not your problem. All you need to do is be careful about only writing each section of the larger VBO once and then moving on, and you’re fine.

Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them?

Not really. My thinking is usually along the lines of “what steps can I take such that the CPU can maximize its rate of work delivery, and get control back without having to wait for that work to complete?”

If you are trying to manage the contents of a VBO such that some portions of it stay constant while other portions are changing, that’s a workload where you would probably have to start using fences or other heuristics to schedule overwrites of pieces of it. (One heuristic is “has this chunk been used to draw anything in the last five frames” - if no, and you know the driver has a three frame queuing limit say, then you can actually infer when it’s safe to overwrite that region without any sync effort, i.e. non blocking map, but you need to make sure you track carefully each segment and mark them in your own data structure when they were last used for draw).

OTOH glBufferSubData will always be orderly and safe for a partial VBO replacement, no matter what has happened recently, but you have to have the source data in copyable form, whereas with mapping you can combine decompression and delivery into the buffer.

IMO the application usually knows more about its operational history than the driver does, and is in a better position to make clever decisions about when sync is needed, which is why MapBufferRange has the unsynchronized option.

whew.