VBOs strangely slow?

Alfonse_Reinheart · March 1, 2010, 6:06pm

Have you tried explicit synchronization with NV_fence/ARB_sync and using GL_UNSYNCHRONIZED with glMapBufferRange?

Dark_Photon · March 1, 2010, 6:51pm

No, sure hadn’t. What do you envision here?

Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle, and thus pipeline the buffer uploads, avoiding stalls. So I’m not seeing where fences come in.

I also didn’t test a technique that has been touted here for buffer upload speed-up (since this is such a trivial test app), and that’s mapping the buffer in a foreground thread, taking the potentially multi-ms hit of the memcpy in a background thread, and then unmapping in the forground thread, with ring-buffer work queues between the threads. But that’s typically only useful if you’ve got other (typically GL) work to do in the foreground thread. This little test app’s just gonna wait on the memcpy to unmap anyway because it has nothing better to do.

Alfonse_Reinheart · March 1, 2010, 9:09pm

Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle, and thus pipeline the buffer uploads, avoiding stalls. So I’m not seeing where fences come in.

GL_UNSYNCHRONIZED is not the same as GL_INVALIDATE.

GL_INVALIDATE tells the implementation, “I don’t care what was in the buffer before; I just want some memory!”

GL_UNSYNCHRONIZED says, “I don’t care that you may currently be using the buffer, and that my attempt to modify it while in use can have horrible consequences. I will take responsibility for making sure the buffer is not in use when I modify it, so give me a pointer already!”

They’re both solutions to the same basic problem (I rendered with a buffer last frame, and I want to change it and use it this frame), but with different needs. GL_INVALIDATE/glBufferData(NULL) is ultimately giving you two buffer objects: the one that’s currently in use and the one you’re writing to. GL_UNSYNCHRONIZED is all about using only one piece of memory to avoid the synchronization.

The idea is that you fill up a buffer object, do something with it, and then set a fence. If you want to change the buffer, you check your fence. If the fence has not passed yet, you go do something else (and therefore this only works when you have “something else” that you could be doing). When the fence has passed, you can now fill the buffer.

Rob_Barris · March 2, 2010, 2:01am

GL_UNSYNCHRONIZED can allow for idioms where the client is generating a large number of small batches dynamically; it makes it much more efficient to stack them up one after another within a smaller number of larger sized VBO’s. For example you could have a 4MB VBO, and be able to map/write/unmap/draw several hundred times using that storage, before ever having to orphan or fence, if you are processing kilobyte-ish batches of data.

In this regard it’s closer to the D3D NO_OVERWRITE hint. “Yes, I know I just wrote 512 bytes of stuff at offset 0, and maybe it hasn’t been processed yet - I would like to go back in and write 1280 bytes of new stuff starting at offset 512 now in the same buffer… and I’d rather not have to wait.” And so you repeat until you hit the end of the buffer - no hazards, no risks.

Concurrency goes up esp on a multi-threaded driver when you can use the cheap operation more frequently than the expensive one (unsync map = cheap … orphaning = less cheap).

When this style makes sense (depends on your app), you can cut way down on the driver memory management work, since it just sees one particular size of buffer being orphaned / recycled, and those events are much less frequent than maps and unmaps.

Ideally, you reach a steady state where the driver is round-robining between a few physical buffers of that one large size, allocations stop happening, and the driver need not care if you are blasting rand()-sized batches in various numbers into that storage.

The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to rewrite some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.

The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.

Baughn · March 2, 2010, 4:42am

I’m afraid all these details are too much for this poor systems programmer. I’ll play the ouija board, figure out code that works well on my own system, and not worry too much about other systems. Still…

What I came up with in the end for the actual application (code here, but there’s way too much of it) is to use two VBOs, for the variable data, which I switch between once per frame (using glMapBufferRange to invalidate if available, glBufferData otherwise), and a static_draw VBO for the quite static vertex grid. This works well enough; it’s as fast as the ncurses output mode, which means about twice the speed of any other mode even counting immutable overhead.

If you really want to see the actual code… uh, the important functions would be swap_pbos in graphics.cpp, and render_shader/init_gl (shader branch, latter) in enabler_sdl.cpp, but I would suggest you stay away. For one thing, the code’s embarrassing and impenetrable.

I’ve also got ARB_sync in there, on the theory that blocking in SDL_GL_SwapBuffers is a very bad thing and I can’t figure out a better way to limit framerates to what my (8600M) gpu can handle.

But now you’re saying display lists are likely to be faster? And the drivers will use multiple VBOs as appropriate if I just invalidate before mapping? Are those also true for ATI cards?

Also, is there an ATI equivalent of bindless graphics?

Pierre_Boudier · March 2, 2010, 4:52am

“But now you’re saying display lists are likely to be faster?”
-> internally, the driver will convert display list to vbo. the main issue with display list is that it is hard to predict when an implementation can optimize, because there are many corner cases in opengl…

“And the drivers will use multiple VBOs as appropriate if I just invalidate before mapping? Are those also true for ATI cards?”
-> yes. the implementation will reallocate a buffer and avoid any unnecessary synchronization overhead.

“Also, is there an ATI equivalent of bindless graphics?”
-> you can use vertex_array_object.

Dark_Photon · March 2, 2010, 5:51am

… Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle[/QUOTE]
My apologies. I tested/meant INVALIDATE, but Alfonse said UNSYNCHRONIZED, and I merely copied and missed the distinction.

And thanks Rob and Alfonse for the detailed responses! I learned a few things, and I’m sure I’m not alone.

Dark_Photon · March 2, 2010, 5:55am

But on NVidia, avoid using VAOs on top of bindless. Yes, it works, but in my experience, you’ll pay a little perf for doing that (but test on your setup to be sure).

Presumably bindless gives you the VAO speed-up, and without (I assume) a bazillion little VAOs floating around in the GL driver.

Baughn · March 2, 2010, 7:57am

Naturally, trying to use display lists ran into the problem that my vertex shader uses gl_VertexID, which appears not to be set when executing display lists.

Is there a reasonable alternative? Some way of setting a per-vertex or per-primitive counter?

skynet · March 2, 2010, 8:03am

Its time for some new whitepapers from ATI/nVidia on how to deal with updating VBOs/UBOs/PBOs quickly. Clean up some myths and get straight on the facts. I’m tired of guessing.

ViolentHamster · March 2, 2010, 8:18am

Rob, I’m not sure I understand. You still need a sync before you go to draw though, don’t you? Should the application keep track of active and inactive VBOs (the GPU may be drawing with the active while the inactive VBOs are ready to be recycled)?

I’m curious about your “dynamically generated batches”. Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them? I’d really like to be able to put all my vertex data directly into VBOs. Unfortunately, I have multiple LODs and I don’t know which LODs I need until I’m finished culling. If I put all my LODs into VBOs, I’d have hundreds of MBs of VBO data. I’m struggling with how to fill VBOs with the correct subset of vertex data while giving the GPU enough time between the upload and the draw call…

Thanks.

Dark_Photon · March 2, 2010, 8:24am

Agreed! Death to the Ouija Board! :sorrow:

Also, interesting blog post from Sunday on this very topic: One More On VBOs - glBufferSubData

Rob_Barris · March 2, 2010, 10:47am

ViolentHamster:

Rob Barris:

The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to rewrite some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.

Rob, I’m not sure I understand. You still need a sync before you go to draw though, don’t you? Should the application keep track of active and inactive VBOs (the GPU may be drawing with the active while the inactive VBOs are ready to be recycled)?

Rob_Barris:

The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.

I’m curious about your “dynamically generated batches”. Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them? I’d really like to be able to put all my vertex data directly into VBOs. Unfortunately, I have multiple LODs and I don’t know which LODs I need until I’m finished culling. If I put all my LODs into VBOs, I’d have hundreds of MBs of VBO data. I’m struggling with how to fill VBOs with the correct subset of vertex data while giving the GPU enough time between the upload and the draw call…

I’ll try to boil this down a bit. First let’s define a workload, then we look at how you can feed it to GL. If your app doesn’t match the workload, then this may not apply to you.

workload: say the CPU wants to draw a series of batches where each one is based on data generated or unpacked right before issuing of the draw request. Once written, the data is not going to be modified or read back by the CPU. The goal is to efficiently let the GPU have access to the newly written data, and to avoid bogging down with excessive allocation or synchronization on a per-draw basis.

( As a hypothetical example, say we’re using the CPU to deform and draw hundreds of falling leaves, where the leaf-shape algorithm runs on the CPU, and can be used to generate new batches of verts for each leaf at will )

So, you can do this with one VBO and no fences, and it can run really well. The magic is hiding in the buffer-orphaning step.

So make a VBO with glBindBuffer, and set its size with glBufferData. A few megabytes is good.

Init a “cursor / offset” to zero.

for each batch:

figure out how many bytes it will be.
round it up to some nice power of two multiple, 64 is good.

orphan current VBO if this batch won’t fit (see below).

map the buffer using UNSYNCHRONIZED, at the current cursor offset, asking for the padded number of bytes to be visible. (On Apple flush-buffer-range, you can map in unsynchronized fashion, you just can’t pick the range, so you always get the base address back - just add the offset to it)
write the data at the beginning of the mapped range.
unmap.
increment the cursor by the padded size used.
issue the draw call after setting vertex attrib pointers appropriately into the VBO, keeping the offset in mind.
repeat.

Note, if you are using an asynchronous or multithreaded driver, you might well get 40, 50, 100 batches written into that VBO (and draw commands enqueued) before the GPU even looks at that first byte. That’s OK. You just want the client thread to get in and out of that VBO as fast as possible so it can stay busy doing work.

At some point the cursor will have moved far enough such that the next batch of data will not fit - i.e. offset + padded size exceeds the total size of the VBO. Note the starred step above.

When this eventuality happens, and it will vary depending on the size of the batches you’ve been dropping into the VBO, the response is very simple.

orphan current storage by doing a new glBufferData using the fixed size chosen for the VBO, and a NULL pointer.
rewind cursor to offset 0.
continue.

The subsequent map result will look at new storage, a clean sheet. The old storage belongs to the driver, it’s no longer associated with the VBO ID that you have in your code. So from one point of view there are now two buffers of storage running around, but the one you orphaned can no longer be accessed by the client code. At some point all the draw calls that are consuming data from that storage will complete - and that storage will be freed or possibly recycled automatically.

In this model, the number of VBO’s known to the client is “one”. The number of floating (orphaned) blocks of storage could be much higher, depending on how long the GPU is taking to chew through each job and how fast the CPU can drop them off.

So you don’t have to juggle “multiple VBO’s”, you just need to keep blasting away at the one VBO while letting the driver swap in new chunks of storage as needed.

Client never needs to fence, or check GPU progress, or block on map.

Write&draw, write&draw, repeat til VBO full, orphan and rewind cursor, repeat. CPU gets to drop off all of its data and draw requests and potentially go on to do other tasks without a care as to how many orphaned buffers (storage blocks) wind up in flight or how fast the GPU is retiring them.

So in the hypothetical example, you might completely fill one buffer with leaf shapes (and have a draw pending on each one), orphan it, start pumping leaves into the VBO again starting at zero offset, process repeats. Are you getting ahead of the GPU by one or more blocks of storage? Maybe. Do you care? No. Let the GPU and driver catch up on their own time (ideally on an alternate CPU core). Keep that client drawing thread unblocked.

Driver only sees fixes size VBO blocks coming and going. Its job to recycle those chunks of storage is greatly simplified. Draw events should outnumber orphan events by some healthy multiple - only you know the likely spectrum of batch sizes. Orphaning 128MB VBO’s is probably too big. Orphaning 2-4MB VBO’s, no big deal.

Going back to your questions

You still need a sync before you go to draw though, don’t you?

Not in this style. You map, write the data, and unmap, you can issue a draw call on that data right away. (An async driver is just stacking up these draw requests to process in order). The key is that you get control back into your code as soon as possible so you can crank up the next batch’s data. You stay disconnected from any idea of how much work the GPU has done or is about to do.

There is subtlety that the “next batch” will usually be mapping the same buffer/storage, but you are not going to alter or step on any data previously emplaced - the ascending cursor sees to that. The world doesn’t end if the GPU reads from address A while you write to address B and they are different.

Again if your workload doesn’t fit this model, you would need to do more explicit sync effort possibly using fences to know “when” it is safe to touch any given region of storage. But if all you do is fill, fill, fill and then orphan and start over - you never need to check or sync. The juggling of multiple blocks of storage is all in the driver and not your problem. All you need to do is be careful about only writing each section of the larger VBO once and then moving on, and you’re fine.

Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them?

Not really. My thinking is usually along the lines of “what steps can I take such that the CPU can maximize its rate of work delivery, and get control back without having to wait for that work to complete?”

If you are trying to manage the contents of a VBO such that some portions of it stay constant while other portions are changing, that’s a workload where you would probably have to start using fences or other heuristics to schedule overwrites of pieces of it. (One heuristic is “has this chunk been used to draw anything in the last five frames” - if no, and you know the driver has a three frame queuing limit say, then you can actually infer when it’s safe to overwrite that region without any sync effort, i.e. non blocking map, but you need to make sure you track carefully each segment and mark them in your own data structure when they were last used for draw).

OTOH glBufferSubData will always be orderly and safe for a partial VBO replacement, no matter what has happened recently, but you have to have the source data in copyable form, whereas with mapping you can combine decompression and delivery into the buffer.

IMO the application usually knows more about its operational history than the driver does, and is in a better position to make clever decisions about when sync is needed, which is why MapBufferRange has the unsynchronized option.

whew.

ViolentHamster · March 2, 2010, 10:55am

Thanks for your response. Let me read through that… When do you sleep?

Rob_Barris · March 2, 2010, 11:05am

I’m just wakin’ up

ViolentHamster · March 2, 2010, 11:53am

Is this approach best suited for highly dynamic objects that are rendered a few frames behind their CPU positions? With orphaning, you don’t draw the same position twice. You always have a fill/draw/fill/draw?

What if you didn’t draw leaves? What if you drew static objects like terrain, or objects that needed collision detection? Would you have to use another approach?

Alfonse_Reinheart · March 2, 2010, 1:03pm

What if you didn’t draw leaves? What if you drew static objects like terrain, or objects that needed collision detection?

If you’re drawing static terrain, you use static buffer objects. Upload once, draw many. GL_STATIC_DRAW. There isn’t really an approach for that

This approach is for objects that you need to constantly generate data for.

ViolentHamster · March 2, 2010, 2:08pm

If I’m creating static buffer objects at runtime, how do I ensure that they have been uploaded before I go to draw with them? I don’t want the draw calls to block until the GPU receives all the data.

I’d like to be able to say, “Hey GPU, upload this high resolution LOD. Let me know when you’re done. In the meantime, can you draw with the low resolution LOD? Thanks for not blocking and causing terrible frame breaks, GPU. You’re super.”

Alfonse_Reinheart · March 2, 2010, 2:41pm

I’d like to be able to say, “Hey GPU, upload this high resolution LOD. Let me know when you’re done. In the meantime, can you draw with the low resolution LOD? Thanks for not blocking and causing terrible frame breaks, GPU. You’re super.”

If it’s a static buffer, doesn’t that mean you’re uploading it at “initialization” time? And how would you know that the low resolution LOD is uploaded yet if you’re not sure about the high LOD?

ViolentHamster · March 2, 2010, 2:53pm

No. Imagine you have more data than will fit on a GPU and you can’t display a “Loading” screen as the character moves–very quickly.

For the terrain or model in question, you’d have to display nothing at first, then the low res. I think I can figure out when nothing is ready.