Performance of opengles glBindBufferBase

I am now doing some optimization for rendering process. I notice that in my game some objects is totally static and never changed.

In my old way, I will create a uniform buffer every frame, and put variables in it, such as objectToWorldMatrix, LightData etc. Then use glBindBufferBase to bind buffer to right slot. The glBindBufferBase cost 0.03ms ~ 0.05ms for 376 objects.

In my new way, I will cache the uniform buffer and just use glBindBufferBase to bind my cached buffer every frame. The glBindBufferBase will cost 0.13ms ~ 0.21ms for 376 objects.

The results is so weird. Why are costs of glBindBufferBase different? Did I miss some special optimization for new created buffer?

Furthermore, Do I need to call glBindBufferBase in every frame if the buffer is constant? Some people say OpenGL is a state machine and setting will not be changed unless you change it. However, I found if I just set it once, my objects would not be rendered.

I found that if I do Map/Unmap before glBindBufferBase, I can get some perfomance boost. It is so weird.

What GPU and GPU driver are you testing this on?

Unless you have specific guidance from the OpenGL ES driver writer to the contrary, you shouldn’t expect creation or allocation of resources from scratch to be fast. It could be triggering implicit sync (which may allow subsequent commands queue faster). I would suggest never putting this kind of thing in your “draw loop”.

Also, assuming you’re reporting CPU draw thread timings here, keep in mind that this is not the cost for performing the glBindBufferBase() operation. It’s how long it takes the OpenGL ES driver to queue the request (to perform this operation later). There’s a big difference. Especially on mobile GPUs (where unless you goofed up, most operations are queued until later).

The buffer contents are static (non-changing), correct? Content updates to reused buffer objects can be slow on mobile GPU drivers (often triggering implicit sync), unless you’re very careful. So I just want to verify that updates are not involved here.

Also, you’re talking like this is a single buffer shared across multiple batches, but you’re also saying glBindBufferBase() instead of glBindBufferRange(). Is there 1 bind per frame and you’re indexing into the buffer object in the shader?

It’s just whatever the driver needs to do to queue this command in the command queue. Unless the elapsed time to queue it is long, I wouldn’t obsess about fractional milliseconds of cost difference to queue one command. I’d place more value in the total time needed to queue your frame.

No, assuming the binding hasn’t been changed.

Keep in mind however that glBindBufferBase() and glBindBufferRange also update the non-indexed buffer bind point for that bind target (i.e. act as if glBindBuffer() had also been called). Your code may not be written to expect this.

I think you’re just shifting around where queuing work is done in the driver.

I am using an i7-10700k and [an NVIDIA RTX] 2070 super with driver 457.51, but my target platform is Android. I’ve test it on Android and it have the same issue like PC.

Yes, the buffer contents are static in my new way. It is not 1 bind per frame. It is depends on the batch size of that frame. Each static object has its unique buffer. It is like:

for each obj in the batch:
    bind the obj's buffer

Ok, sounds good.

Since the data is static, you can reduce the total UBO bind cost by:

  1. storing the data for many (or all) of your objects in the same uniform buffer object (UBO),
  2. doing 1 shared bind for all, and then just
  3. indexing the UBO in the shader to pull the data for the objects you want to draw.

If you need variable-length data, you can use a shader storage buffer object (SSBO) instead, or just overallocate the UBO.

Similarly, you can reduce the total bind cost for other buffer objects (VBOs, IBOs, SSBOs, Draw Indirect, etc.) by:

  1. merging batches into a smaller set of buffer objects,
  2. doing 1 set of binds at top-of-frame, and
  3. dispatching draw calls using offsets into those buffer objects rather than binds + offsets.

Similarly, you can reduce your total draw call cost by merging batches (dispatching multiple per draw).

Any of the above should reduce your total buffer object bind cost (including UBOs) on both setups (NVIDIA RTX 2070 Super PC and Android).

On NVIDIA, you additionally have the option to ditch the UBO binding and use bindless (GPU addresses) to access the UBO. It’s pretty much just:

  1. Make the buffer object GPU resident (glMakeBufferResidentNV()),
  2. glEnableClientState( GL_UNIFORM_BUFFER_UNIFIED_NV ), and
  3. Use glBufferAddressRangeNV( GL_UNIFORM_BUFFER_ADDRESS_NV ) to provide the GPU address of your UBO to the OpenGL driver.

Of course, you can also use the other NVIDIA Bindless Graphics extensions to bypass binding for other buffer objects, effectively doing the same thing for the vertex buffer objects, index buffer objects (if any), and other buffer objects used by draw calls (except SSBOs). See below for NVIDIA bindless-related extensions.

With NVIDIA Bindless Graphics, you can get the fastest draw call submission performance (comparable to using their display lists) on NVIDIA GPUs, for a given buffer object arrangement and set of draw calls. The more you merge buffer objects and batches, the less total speed-up you get from bindless. And that works on any platform.

NVIDIA OpenGL Extensions related to Bindless Graphics:

Bindless VBOs / IBOs (and shader reads via GPU pointers):

Bindless UBOs / DrawIndirect:

Shader writes via GPU pointers:

  1. storing the data for many (or all) of your objects in the same uniform buffer object (UBO)

Does it mean I need a very big buffer for all static objects? If I have a lot of objects in the scene, the buffer will be very big. I think may it has a limitation on size.

  1. indexing the UBO in the shader to pull the data for the objects you want to draw.

I don’t understand this part. Do I need to use glBindBufferRange with offsets? I have a big buffer now, I need glBindBufferRange to bind right?

Yes. It doesn’t necessarily require all static objects in one buffer object. However, if you think max UBO size may be an issue, keep in mind that GL_MAX_UNIFORM_BLOCK_SIZE is not the maximum size for a buffer object storing UBO data, but rather the maximum subsection of that buffer object (or any buffer object) that you can map into a single binding index in the shader at any given time. You can have a massive buffer object storing multiple chunks of UBO data. Or you can split this across multiple buffer objects if you want.

Alternatively, you could use one big SSBO instead. SSBOs have the advantage that the actual size of the block known to the shader can be dynamic (i.e. not compiled into the shader but determined later by what buffer object you bind), as well as having a GL_MAX_SHADER_STORAGE_BLOCK_SIZE that’s typically very large (compared to GL_MAX_UNIFORM_BLOCK_SIZE).

I’m just trying to figure out what you were doing.

  1. glBindBufferBase() binds the whole buffer object to an indexed bind target.
  2. glBindBufferRange() binds a part of a buffer object to an indexed bind target (could be whole)

If you store the uniform data for multiple batches (draws) in a single, shared buffer object, you have to do something so that the shader reads in the correct part of the buffer object for that draw. There are a lot of options here.

One is you could use glBindBufferRange() per draw call so that only the correct part of the buffer object is “attached” to the shader for that one draw call.

Another is you could use glBindBufferBase() (or glBindBufferRange()) to bind a portion of the buffer with the uniforms for many draw calls. Then inside of your shader, offset your lookups so they pull the correct block of uniforms for this draw. With GLSL struct arrays, this is super-easy, and the compiler does the offset math for you. For instance:

layout( std140 ) uniform;
layout( std430 ) buffer;

struct BatchData
{
    mat4    modelview;
    vec4    color;
};

layout( binding = 0 ) buffer BatchDataArray
{
    BatchData    drawData[];

} mySSBO;  

Then just index this with:

int index = ...;
mat4 mat = mySSBO.drawData[ index ].modelview;
...

Same thing applies whether using an SSBO or a UBO. However, with a UBO you must specify a fixed-length array for drawData[] and not leave it open-ended. And you need to follow std140 alignment rules instead of std430.

Thank you very much. In fact, I am trying to do some optimization for Unity. Now I am still suffering from doing binding operation every frame. I found that Unity clears all bindings in the beginning of each frame.

I have no idea why does Unity clear all bindings in the beginning. Does it assume most of objects in my scene are dynamic(objs needed to be updated their UBOs very frame)? I am guessing if I have a lot of dynamic objects, clearing all bindings in the beginning and assign bindings later may boost rendering performance?

You’ll have to ask the Unity guys to be sure, but after seeing this, my guess is that they do state tracking (aka lazy state changes), one method of state change reduction.

This basically involves wrapping sometimes-expensive graphics API state changes with little “if this value == last-set value, skip it” wrappers.

The pro is it can save time avoiding needless calls into the graphics API. One con though is that if some code doesn’t make use of these wrappers to change state (or if there’s a bug in them), this can result in “state leaks”. That is, cases where the app thinks the state is X but in OpenGL it’s really Y, leading to misbehavior caused by rendering or updating with the wrong GL state active.

To guard against this, sometimes state tracking engines will use “state reset” sequences at the top of each frame like this, to force the actual state and tracked state to be identical again. This doesn’t always “clean up the mess” left by the state leak, but it’s often better than just leaving the state mismatch out there until the end of the run to cause more havok.

I’ll take your word for that, but so far I haven’t seen compelling evidence that you’re really suffering from buffer binding cost. You’re reporting total time across many BindBuffer calls of fractions of a msec.

Even if there was a slowdown here, my guess is that the bind isn’t what’s expensive, but the work the driver decides it needs to catch up on to queue the bind operation.

Question:

You still haven’t answered this. What’s the scoop?

You have said the “the buffer is constant”. But given the above, I’m starting to suspect that the buffer “binding” is constant, but the buffer “contents” are changing every frame. And that could very well affect the cost you see associated with binding a buffer object.

The buffer contents are static (non-changing), correct?

Yes, in my test demo, all buffer contents are static.

I digged into the source code today and I felt I understand why does Unity do bindings very frame.

Unity has a feature called SRP Batcher, which batches objects with same shader.

In the beginning of the batch, it uses glProgram to set the shader. This shader is shared for all objects in the batch.

It is like:
glUseProgram(Shader A);
for object in the batch{
bind object’s buffer
draw obj
}

If I only do binding in the first frame, in the next frame, the Shader A will just keep the binding of the last object of the previous batch.

I am not pro on OpenGL, all thinkings above is my guess.

You’re reporting total time across many BindBuffer calls of fractions of a msec.

Yes, the cost is tiny on PC. However, on the mobile platform, it is a problem. In the old way, 1486 calls of binding operation is 2.35ms.

In the new way, 1486 calls of binding operation is 3.63ms.

It is obvious that my rendering thread takes longer time after using new way, although I’ve cached the buffer and don’t need to create the buffer every frame.

You’re talking times of 1-2 microseconds ( μsec) per call. Seems pretty reasonable to me.

I think you’re doing the right thing with trying to eliminate needlessly creating and changing of buffer objects every frame.

Going forward, I think I’d first minimize your test setup to avoid interactions with other pieces. Namely, disable all content and effects except that which makes use of these buffer objects. Re-bench. Make sure your assumptions still hold. Then I’d look at combining all of those many buffer objects into 1 or a smaller set. That should allow you to get rid of most or all of those binds. Then you can recheck perf and see if you’ve really saved that cost you’re ascribing to the bind calls.