UBO poor performance [GL 3.1]

Executor · October 15, 2009, 11:05pm

I try use UBO, but i have poor performance with him.

Code w/o UBO:

mat4 matLocal = ...;
mat4 matMVP = ...;
vec2 uvBase = ...;
vec2 perlinMovement = ...;
vec3 localEye = ...;
glUniformMatrix4fv(uniform_matLocal, 1, false, matLocal);
glUniformMatrix4fv(uniform_matMVP, 1, false, matMVP);
glUniform2fv(uniform_uvBase, 1, uvBase);
glUniform2fv(uniform_perlinMovement, 1, perlinMovement);
glUniform3fv(uniform_localEye, 1, localEye);

Code w/ UBO:

struct BlockPerBatch
{
	mat4 matLocal;
	mat4 matMVP;
	vec2 uvBase;
	vec2 perlinMovement;
	vec3 localEye;
};

BlockPerBatch blockPerBatch;

glBindBuffer(GL_UNIFORM_BUFFER, ubo_BlockPerBatch); // once for all batches

...

blockPerBatch.matLocal = ...;
blockPerBatch.matMVP = ...;
blockPerBatch.uvBase = ...;
blockPerBatch.perlinMovement = ...;
blockPerBatch.localEye = ...;
glBufferData(GL_UNIFORM_BUFFER, sizeof(blockPerBatch), &blockPerBatch, GL_DYNAMIC_DRAW);

Shader:

#version 140

...

uniform BlockPerBatch
{
	mat4 matLocal;
	mat4 matMVP;
	vec2 uvBase;
	vec2 perlinMovement;
	vec3 localEye;
};

...

w/o UBO - ~250 FPS
w/ UBO - ~225 FPS

GeForce 9600GT
Win7 Driver 190.89
OpenGL 3.1

What i do wrong?

imported_Groovounet · October 16, 2009, 2:19am

Do you reallocate your buffer at each frame?

UBO comes well together with MapBufferRange or MapBuffer and apparantely even glBufferSubData would be faster.

Have a look on the MapBufferRange API, that’s THE way to go!

Brolingstanz · October 16, 2009, 3:34am

… Also be sure to group by frequency of update. E.g. Per-frame, per-sector, per-object, per-culator, per-fume, …

Executor · October 16, 2009, 3:53am

In example from spec used glBufferData:

    void render()
    {
        glClearColor(0.0, 0.0, 0.0, 0.0);
        glClear(GL_DEPTH_BUFFER_BIT|GL_COLOR_BUFFER_BIT);
        
        glUseProgram(prog_id);

        glEnable(GL_DEPTH_TEST);
        glMatrixMode(GL_MODELVIEW);
        glLoadIdentity();
        glTranslatef(0, 0, -4);
        glColor3f(1.0, 1.0, 1.0);
        glBindBuffer(GL_UNIFORM_BUFFER, buffer_id);
        //We can use BufferData to upload our data to the shader,
        //since we know it's in the std140 layout
        glBufferData(GL_UNIFORM_BUFFER, 80, colors, GL_DYNAMIC_DRAW);
        //With a non-standard layout, we'd use BufferSubData for each uniform.
        glBufferSubData(GL_UNIFORM_BUFFER_EXT, offset, singleSize, &colors[8]);
        //the teapot winds backwards
        glFrontFace(GL_CW);
        glutSolidTeapot(1.33);
        glFrontFace(GL_CCW);
        glutSwapBuffers();
    }

SubData only for update one uniform in block.
I try glBufferSubData for all - fps is equal glBufferData.

I sure…

imported_Groovounet · October 16, 2009, 5:51am

Really glBufferSubData and glBufferData are not good solutions.
This sample works but it’s a not point to use in real applications. Calling glBufferSubData for a single data update would be worth that using glUniform* which is still possible to do within a uniform buffer.
Calling glBufferData is like a “C++ new” with OpenGL, you don’t want to do so to upload you data!

Create and allocate the buffer once with glBufferData but update with the MapBufferRange API. Parallel, async and a fine grain control.

You can actually use a single buffer to pack all your “block per” kind of data as far as you keep the uniforms group together.

Example of a single uniform buffer:
128 bytes Per-frame
64 bytes Per-object
16 bytes Per-batch

Don’t forget that GPU have a memory bust size with a minimun of 64 bytes usually, there is a balance to find to reach a good granularity and that’s why I like the single grouped uniforms buffer approached.

And then you can have just the right amount of byte pick up and update with MapBufferRange even in parallel as far as it doesn’t overlap.

Even If you have some huge amount of uniforms you could use several CPU threads to update the buffer data per block and send those data as you go in parallel.

Executor · October 16, 2009, 6:27am

I try MapBufferRange later, tnx…

I have update drivers to 191.07 WHQL:

w/o UBO - ~250 FPS
w/ UBO - ~240 FPS

Result is better…

imported_Groovounet · October 16, 2009, 6:41am

Humm

What’s you result with glUniform?

skynet · October 16, 2009, 7:01am

Groovounet:
Do you have data to backup your claim that MapBufferRange is faster than glBufferData for small buffers?

I’m using two UBOs to store per-View and per-Object matrices. These two buffers are not bigger than 240Bytes each. Whenever I need to change one of them, I upload the whole contents via glBufferData. This gives the driver a hint “the old data is no longer needed”, and if the old contents is stll in use, it might use a double-buffer scheme internally to not stall the pipeline.

The_Fiddler · October 16, 2009, 7:08am

You can achieve the same effect by calling glBufferData(null) and glMapBuffer.

I’ve tested on a few different drivers, and there’s no clear winner between glBufferData and glMapBuffer. The only significant difference occurs when streaming data, where MapBuffer pull ahead (i.e. it allows you to write directly to the mapped region, and avoid allocating a temporary client-side buffer).

imported_Groovounet · October 16, 2009, 7:31am

I never considered that glBufferData could not stall actually. How does glBufferSubData affect your performances?

I have seen MapBufferRange with quite large buffers that’s why I pack everything in a single buffer, to keep it large enough.
I quite assume that the MapBufferRange “access” parameter give the hits to the drivers.

For small buffer … When you get all your uniforms in single uniform buffer it’s not that small …

(PS: I’m going to digg a bit more on this topic, I’ll let you know with numbers!)

skynet · October 16, 2009, 8:26am

For small buffer … When you get all your uniforms in single uniform buffer it’s not that small …

I did not state clear enough. I do not have one buffer per object. Instead, I have one buffer that stores the modelviewprojection matrix. This UBO has to be changed per object (or better: per draw-call that is using a different matrix). The same UBO is shared by all shaders, though. This is my way to compensate the loss of the gl_ModelViewProjectionMatrix and ftransform() built-ins when I switched over to GL3.0+

Executor · October 16, 2009, 8:39am

w/o UBO (using glUniform*) - ~250 FPS

marshats · October 16, 2009, 9:58am

First thing, why recompute sizeof(blockPerBatch) with each glBufferData(…sizeof(blockPerBatch)…). I would make a single call GLuint sizeof_blockPerBatch = sizeof(blockPerBatch) then glBufferData(…sizeof_blockPerBatch…).

Second, do you get a speed/FPS improvement if you use layout(std140) in your shader like


layout(std140) uniform BlockPerBatch
{
	mat4 matLocal;
	mat4 matMVP;
	vec2 uvBase;
	vec2 perlinMovement;
	vec3 localEye;
};

Code w/ UBO:


GLuint uniformBlock_blockPerBatch_id;
GLfloat blockPerBatch[] =
{                  //layout(std140) uniform matrix1
  1.0,0.0,0.0,0.0, //mat4  matLocal
  0.0,1.0,0.0,0.0,
  0.0,0.0,1.0,0.0,
  0.0,0.0,0.0,1.0,
  1.0,0.0,0.0,0.0, //mat4 matMVP
  0.0,1.0,0.0,0.0,
  0.0,0.0,1.0,0.0,
  0.0,0.0,0.0,1.0,
  0.0,0.0, 1,1, //vec2 uvBase (last 1,1 is filler)
  0.0,0.0, 1,1, //vec2 perlinMovement (last 1,1 is filler)
  0.0,0.0,0, 1, //vec3 localEye (last ,1 is filler)
};
GLuint sizeof_blockPerBatch = sizeof(blockPerBatch);

//convenience map into blockPerBatch
mat4 &matLocal = (mat4&)uniformBlock_matrix1[0]; 
mat4 &matMVP = (mat4&)uniformBlock_matrix1[16]; 
vec2 &uvBase = (vec2&)uniformBlock_matrix1[32];
vec2 &perlinMovement = (vec2&)uniformBlock_matrix1[36];
vec3 &localEye = (vec3&)uniformBlock_matrix1[40];

defineUniformBlockObject(0,"BlockPerBatch",uniformBlock_blockPerBatch_id); // once for all batches

matLocal = ...;
matMVP = ...;
uvBase = ...;
perlinMovement = ...;
localEye = ...;

glBindBuffer(GL_UNIFORM_BUFFER, uniformBlock_blockPerBatch_id);

glBufferData(GL_UNIFORM_BUFFER, sizeof_blockPerBatch, &blockPerBatch, GL_DYNAMIC_DRAW); // don't recompute sizeof() every call!

where the helper defineUniformBlockObject function is


void defineUniformBlockObject(GLuint binding_point, const char *GLSL_block_string, GLuint &uniformBlock_id)
{
 glGenBuffers(1, &uniformBlock_id);

 //"layout(std140) uniform GLSL_block_string"
 GLuint  uniformBlockIndex = glGetUniformBlockIndex(shader_id, GLSL_block_string);

 //And associate the uniform block to binding point
 glUniformBlockBinding(shader_id, uniformBlockIndex, binding_point);

 //Now we attach the buffer to UBO binding_point...
 glBindBufferBase(GL_UNIFORM_BUFFER, binding_point, uniformBlock_id);

 //We need to get the uniform block's size in order to back it with the
 //appropriate buffer
 GLsizei uniformBlockSize;
 glGetActiveUniformBlockiv(shader_id, uniformBlockIndex,
  GL_UNIFORM_BLOCK_DATA_SIZE,
  &uniformBlockSize);

 //Create UBO.
 glBindBuffer(GL_UNIFORM_BUFFER, uniformBlock_id);
 glBufferData(GL_UNIFORM_BUFFER, uniformBlockSize, NULL, GL_DYNAMIC_DRAW);
}

I see speed improvement using this over a bunch of separate glUniform* calls. But I haven’t tested it extensively. I would be curious if in your case using “layout(std140) uniform” has any effect.

kRogue · October 16, 2009, 11:33am

w/o UBO - ~250 FPS
w/ UBO - ~225 FPS

I know lots of you all love FPS as a speed measurement, but you really should look at how much time it takes to render, rather than how many renders per second:

w/o UBO 0.004 seconds
w UBO 0.0044 seconds

so the difference in time to render is, ahem, 0.4 ms, um is that even really a difference?

Additionally, the data you have tied to the UBO is not that much: 2 mat4’s, 2 vec2’s and 1 vec3 –> 39 floats not exactly a lot of data.

How big are the meshes being rendered? How many? As the number of draw calls goes up and the number of different shaders go up you will find that UBO will beat glUniform calls, but right now the difference in time is 0.4 ms, which is once you get into the realm of say 60fps/120fps is not even noticeable even in the FPS speed rating.

marshats · October 16, 2009, 12:13pm

Good comment on small block size and difference of .4ms!. That difference is probably timer precision error.

Note I use FPS as a measure based on the post on performance measurements.

kRogue · October 16, 2009, 12:42pm

My over simply way to measure FPS is to just track the time after each buffer swap (i.e. glXSwapBuffers, or whatever). It might not be perfect for a given frame but gives a good overall picture of how much time a typical frame is using.

Edit: after reading that link:
Looking at the link, that is exactly what they say to do. Silly me.

Executor · October 19, 2009, 8:58am

Bad advice…
sizeof() doing in compile time…

Second, do you get a speed/FPS improvement if you use layout(std140) in your shader like

No difference…

~30 batches per frame
max ~130 batches per frame

dv1 · October 19, 2009, 12:37pm

By the way, I am a bit confused about the std140 layout. It seems MUCH more practical than the regular one, not requiring tons of queries etc. But it is safe to use std140 all the time for UBOs with constantly changing content? Or are there any disadvantages of std140 I should know?

Rob_Barris · October 19, 2009, 12:44pm

Tradeoffs with using std140 layout:

your code is simpler since layout is known at build time
layouts can be more readily known.shared amongst multiple programs

the runtime cannot optimize/pack/relocate “dead” uniform slots in cases where the program does not reference all of them. This can happen when “uber shaders” with lots of conditionally activated paths are in play.
there may be some data packing opportunities that std140 precludes the runtime from using - which could be vendor or processor specific.

So “it depends”.

Look at issues 47/48 in the extension spec for UBO:

http://www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt

dv1 · October 19, 2009, 1:10pm

I see. So, for instance when I want to stream instancing data from somewhere, std140 would be beneficial since I could simply copy the data into the UBO instead of many small copies to the respective offsets. But for other scenarios where I for example simply update a model matrix now and then inside the UBO, a packed layout would make more sense. Did I understand this correct?