Switching uniform block buffers

congard · July 25, 2020, 9:40pm

Hi, I have a question about how to use UBO efficiently. I have models with skeletal animations, so I have to load the bones into the shader. Now each model with bones has its own uniform buffer. In it, I write all the bones 1 time per frame.

Next, rendering begins in the “color framebuffer”. If the model has bones, glBindBufferBase(GL_UNIFORM_BUFFER, m_bindingPoint, m_uniformBuffer->m_id) is called. Likewise with rendering in depth maps.

The question is: how correct is this? It just seemed to me that fps dropped by ≈1.5 frames in comparison with glUniformMatrix4fv. Or this function cannot be the reason for the decrease in fps, and should I look for the problem elsewhere? Bones count (glm::mat4) ≈15.

I configure the buffer like this: glBufferData(GL_UNIFORM_BUFFER, size, nullptr, GL_DYNAMIC_DRAW)

I update data like this: glBufferSubData(GL_UNIFORM_BUFFER, offset, sizeof(mat4) * bones.size(), value_ptr(bones[0]))

I will be grateful for any advice

GClements · July 25, 2020, 9:59pm

If you’re doing that every frame, it may be causing a pipeline stall. Consider the advice given in the Buffer Object Streaming wiki page.

congard · July 26, 2020, 9:18am

Thank you for your reply! As far as I understand, to implement explicit multiple buffering I need something like what is described here on page 12 (401), right?
If yes, can I call glFenceSync at the end of my render() function, not after glDrawElements?

GClements · July 26, 2020, 10:03am

That’s one possible approach, but it isn’t the only one.

The main advantage of using GL_MAP_UNSYNCHRONIZED is that you can be sure that glMapBufferRange won’t block. You can poll the sync object and if it isn’t signalled you can utilise the CPU for something else rather than just blocking. The other approaches generally end up blocking the client if it tries to submit data faster than the GPU consumes it.

If you don’t need that functionality, it may be simpler to just use orphaning or alternating buffers/regions. That will block the CPU if it tries to submit data faster than the GPU can consume it, but should avoid the situation where neither the CPU nor GPU run at full capacity because each spends some time idle waiting on the other.

You can call it at any point, but the logical place to call it is immediately following the last operation which depends upon the data. In case it isn’t clear, glFenceSync doesn’t block the client, it just inserts a fence into the command stream so that you can subsequently check whether the prior commands have completed (meaning that the data can safely be overwritten). The glClientWaitSync call checks whether the sync object has been signalled and optionally waits until it has.

Moving the fence farther down the command queue means you might end up waiting longer than necessary to upload new data, potentially reducing GPU utilisation.

congard · July 26, 2020, 11:04am

I just measured average fps using glBufferSubData (1) and using this (2):

glBufferData(GL_UNIFORM_BUFFER, size, nullptr, GL_DYNAMIC_DRAW);
glBufferSubData(GL_UNIFORM_BUFFER, offset, sizeof(mat4) * bones.size(), value_ptr(bones[0]));

and this (3):

glBufferData(GL_UNIFORM_BUFFER, size, nullptr, GL_STREAM_DRAW);
glBufferSubData(GL_UNIFORM_BUFFER, offset, sizeof(mat4) * bones.size(), value_ptr(bones[0]));

And I got the following results:

41.8417 avg FPS in 120 sec
41.4797 avg FPS in 120 sec
41.5000 avg FPS in 120 sec

The second method turned out to be slowest on my GPU (Nvidia GeForce 920mx)

While using glUniformMatrix4fv I got about 42-43 avg fps

Is explicit multiple buffering can help me solve this fps problem?

Edit: If I create one big buffer and write the bones of all the models into it, and link it for each model using glBindBufferRange, do I get even a small performance gain (now each model has own uniform buffer)? After all, only one glBindBuffer per frame for write bones data, instead of calling it for each model

Dark_Photon · July 27, 2020, 1:32pm

I just went through this recently too (optimizing UBO subload and rendering performance on NVIDIA GL drivers). So I have some tips that might be useful to you. But first, let’s talk about capturing performance metrics.

First thing’s first. Make sure your timings are right.
You can spend/waste a ton of time chasing ghosts with bad timings…
Assuming you’re using CPU timers around your frame, turn off VSync, and make sure to call glFinish() at the end of your frames (only for collecting these timings).
Measuring the time from immediately after glFinish() in frame N to immediately after the glFinish() in frame N+1.
Capture that time as your frame time.

Second, measure your performance in time/frame (msec), not frames/time (FPS). Frame time is linear with performance, and frame time deltas are meaningful. FPS is non-linear, and FPS deltas are meaningless. That is, the same performance optimization yield’s “different” FPS deltas depending on the starting FPS, which is not helpful. You can read more about this in many places, including here: Performance

You’re not averaging FPS values are you? That’s nonsense. Assuming you’re not, this is really:

23.9 ms (per frame)
24.1 ms
24.1 ms

Over the course of a whole frame, how accurate (in msec/frame) do you think your frame times are “really”? Be careful you don’t make decision based on noise in the signal.

Once you’ve got good timings, and you have a good feel for what he accuracy of those timings are, then…

Try different tests to determine what you’re largest bottleneck is. For instance, what happens if you stop updating your bone matrix UBO buffer at all and just render with those values? What frame time change do you see? What happens if you just bind your UBO once at the start of your frame and then just render with the same UBO binding in every draw call? What happens if you hard-code bone matrices in the shader? What happens if you cut your vertex count in half? How about cutting your screen res in half? This being skeletal animation on low-end laptop from 4 years ago, you might be vertex bound.

More on UBO tips in the next post…

Dark_Photon · July 27, 2020, 1:56pm

As for updating and rendering with UBOs efficiently on NVIDIA GL drivers, NVIDIA has some good tips in their tech docs and past presentations, and they have special fast paths in their driver specifically for UBOs. Which ones are applicable to you is going to determine on how much you can optimize your UBO updates.

A few UBO-related quotes from NVIDIA docs.

“NVIDIA optimizes glBufferSubData for buffers that are only used as UBO”
“group uniforms by frequency of change into dedicated uniform blocks to maximize re-use. By using larger buffers and glBindBufferRange with offsets you can improve performance switching between them.”
“NVIDIA’s OpenGL driver actually also optimizes uniform buffer binds where just the range changes for a binding unit.”
“Low binding units should reflect data that is shared for many operations and shaders.”
To emulate Vulkan’s push contants (fast-changing uniform data), “Uniform Arrays could be useful for small data changes” or “alternatively use a single small UBO and glBufferSubData its content.”
NV_uniform_buffer_unified_memory - Allows you to access UBO data in the shader without explicit buffer base or range bindings.

Also, definitely read OpenGL Scene Rendering Techniques slides 19-27.

These pretty well speak for themselves. There’s at least two different usage models you can go with here. Update+draw+update+draw, or big-update+draw+draw+draw. If you can, definitely try the latter. That’s when you’ll be able to make best use of glBindBufferRange() (to point a draw call to its own bone matrices in a subset of the UBO) or NV_uniform_buffer_unified_memory (to do the same, but without invoking any bind calls. However, given the fast-paths in the driver for glBindBufferRange(), you may not need this.

If you’re using the update+draw+update+draw model, then that isn’t going to help you much (and binding is less likely to be your bottleneck; just don’t rebind if the binding hasn’t change). Just make sure you’re updating with SubData. Make sure you’re doing the minimal number of updates required. And if using bindless in your engine, definitely test not making that buffer object GPU resident, as all those updates are more expensive in this case. Let the driver handle UBO buffer object residency internally based on its own default behavior. If you find that you are still update limited in this “lots of updates” case, I’d definitely bench this against using an ordinary (non-buffer-object) uniform array, particularly given NVIDIA’s recommendation above for a GL push constant work-alike.

If you find that you’re draw call limited, batch up your draws using instancing and/or multi-draw indirect. Rendering multiple instances of the same skin mesh is perfect for this, as you’re passing the same vertices down-the-pipe but just applying different uniform state to them in the shader.

If you find that you’re update limited, consider minimizing your updates by doing one at the beginning of the frame and sharing it across your draw calls. You could also consider getting rid of them entirely by pushing the joint transform (bone matrix) computation onto the GPU.