Uniform Buffer Objects vs. Classic Uniforms Performance


I’m currently playing around with some of the more recent technologies in OpenGL, hoping to speed up the rendering code of my engine a bit. One thing I’ve done is using UBOs instead of the classic glUniform* functions. I thought it would be faster to have the uniform data in a fixed buffer and just bind it every frame for every object. I’m using the following code for testing:

void DefaultRenderer::renderStaticMeshes(list<StaticRenderingMeshEntry>::iterator beg, list<StaticRenderingMeshEntry>::iterator end)
    GLuint currentVAO = 0;

#ifndef TEST_USE_UBO
    glUniform4f(renderProgramLocations->matAmbientReflectionUniform, 1.0f, 1.0f, 1.0f, 1.0f);
    glUniform4f(renderProgramLocations->matDiffuseReflectionUniform, 1.0f, 1.0f, 1.0f, 1.0f);
    glUniform4f(renderProgramLocations->matSpecularReflectionUniform, 1.0f, 1.0f, 1.0f, 1.0f);

    for (list<StaticRenderingMeshEntry>::iterator it = beg ; it != end ; it++) {
        StaticRenderingMeshEntry& entry = *it;
        StaticRenderingMesh* smesh = entry.mesh;

        int vcOffs = smesh->getVertexColorOffset();

        GLuint vao = smesh->getVertexArrayObject();

        if (vao != currentVAO) {
            currentVAO = vao;

        glBindBufferRange(GL_UNIFORM_BUFFER, 0, matrixUBOs[currentMatrixUBOIdx], entry.id * matrixUBOObjectSize, 3*sizeof(Matrix4));

        glBindBufferBase(GL_UNIFORM_BUFFER, 1, smesh->getUniformBufferObject());

#ifndef TEST_USE_UBO
        glUniform1i(renderProgramLocations->vertexColorsUniform, (vcOffs != -1) ? 1 : 0);

        GLuint tex = smesh->getTexture();

        if (tex != 0) {
            glBindTexture(GL_TEXTURE_2D, tex);
#ifndef TEST_USE_UBO
            glUniform1i(renderProgramLocations->texturedUniform, 1);
            glUniform1i(renderProgramLocations->textureUniform, 0);
        } else {
#ifndef TEST_USE_UBO
            glUniform1i(renderProgramLocations->texturedUniform, 0);

#ifndef TEST_USE_UBO
        uint8_t r, g, b, a;
        smesh->getMaterialColor(r, g, b, a);

        if (globalAmbientLightEnabled) {
                    r / 255.0f, g / 255.0f, b / 255.0f, a / 255.0f);

        GLenum polyMode;

        switch (smesh->getPrimitiveFormat()) {
        case RenderingPrimitiveTriangles:
            polyMode = GL_TRIANGLES;
        case RenderingPrimitiveTriangleStrip:
            polyMode = GL_TRIANGLE_STRIP;
        case RenderingPrimitiveLines:
            polyMode = GL_LINES;
        case RenderingPrimitivePoints:
            polyMode = GL_POINTS;

        glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, smesh->getIndexBuffer());

        glDrawRangeElements(polyMode, 0, smesh->getVertexCount()-1, smesh->getIndexCount(), GL_UNSIGNED_INT, (void*) 0);

The interesting thing here is not the UBO that is bound with glBindBufferRange, but the one that is bound with glBindBufferBase. Now, when I run this code with TEST_USE_UBO defined, my program runs at about 460 FPS. When I run it without TEST_USE_UBO defined, it is more like 480 FPS. If I just remove the glBindBufferBase call in TEST_USE_UBO mode (which obviously produces wrong results, but should still allow for some measurements), it goes up to 500 FPS.
Is it normal for glBindBufferBase to be so slow? In that case, should I just use the old glUniform* calls instead of UBOs for smaller amounts of uniform data like I am using here? I suppose UBOs are more efficient when the amount of uniform data grows. Or is there something I am doing wrong, or which I could optimize?

I’ve tested this on x86_64 Linux with the Nvidia driver 319.49 on a GTX 460.

Thanks in advance!