glUniform is slow?

Hello. I’ve found that the glUniform instructions I use every time I change a material in my engine causes the program to slowdown. Currently I’m performing about 4000 material changes, each one with several glUniform calls and this kills my performance to about one fourth.

Although I can reduce the number of material changes, I would like to understand better why is glUniform so slow.

My graphis board is a 9800GTX. I’ve heard that NVIDIA drivers do ugly stuff when glUniform is called, is this true?

Can’t exactly tell you WHY it is slow, but i can confirm that it IS indeed very very slow. I have about the same usage pattern, that you do. Usually there are many redundant state-changes, since many materials share common parameters. You should try to cache them and prevent unnecessary glUniform calls, that will greatly improve performance.

Jan.

Thanks for the confirmation.
I’ve changed my engine to don’t change the material when the last material is the same as the new one (should have done that before, duh), and that solved the problem in this case. But this issue still worries me.

I’ve done a simple test on an ATI card and it doesn’t seem to suffer from the same problem. Could this be only an NVIDIA issue?

Could this be only an NVIDIA issue?

nVidia has been known to make “optimizations” where, if you change certain uniforms in certain ways, it recompiles your shader program into a more optimal form. Unfortunately, it’s not much of an optimization when uniform values are constantly changing.

Well, hardware & drivers from different manufacturers do behave differently, of course, so it is very possible that ATI doesn’t suffer so badly from this kind of state-change than nVidia does. However it IS a potentially very hazardous state-change on any kind of hardware. You might want to consider caching and preventing state-changes on a per-uniform level, not only material-changes in general. However, you say you sometimes set the same material, that is already in use? Maybe you should first try to sort your data by material, if that is possible.

Jan.

I knew that state changes should be maintained to a minimum in OpenGL, but I never thought that it would be so important. I’ll definetly have to do the kind of sorting you’re talking about, Jan.
Thanks for all your replies.

That’s why they created GL_EXT_bindable_uniform except I can say it sucked. I couldn’t understand how to use it.

The workloads I’m familiar with don’t match up well with EXT_bindable_uniform either, and we have been working on something better for a while now.

That’s good news! EXT_bindable_uniform is already really good to me.

V-man, I can give you a code sample, it’s not really a big deal … even if it took me a will to figure it out XD.

This is what I don’t get about this problem. Why would it ever be slow? Isn’t it the exact purpose of ‘uniforms’ to keep you from binding unique programs with hard-coded constants?

I got the feeling the only reason glUniform does anything at all is because nVidia wanted to be compliant with the spec, and it just calls glCompileShader/glAttachObject/glLinkProgram under the hood. Are they just being cheap or is there some (hardware-) technical reason?
I’m obviously no hardware programmer, but isn’t it just like overwriting at certain location in a memory block (reserved for the bound program) with the the uniform value? The API surely seems to suggest so (glGetUniformLocation)…

I agree with your statement 100%. It would be most helpful, if anyone with more insight could actually tell us WHY nVidia does the recompiling (and maybe what other reasons there are for this to be slow). I can’t imagine, that squeezing out a few instructions is worth the time spent on compiling the shaders (obviously it isn’t).

Jan.

I think that the main reason are programs using one big shader containing many uniform-based if statements (or similiar construct) used to implement materials with various features. In that case the compiler might remove significant ammount of code or reduce amount of registers used by shader. Both of which can significantly improve performance.

Reason why this is slow. One logical assumption is that if you have good optimizing compiler for GLSL language, the easiest way to optimize out part of the program is to change the uniform to constant and let the optimizer do its job. Just as if the constant was originaly created by the shader creator.

just one thought. havent tried it myself but should it better if certain values are “packed” into vectors or matrices and send to the shader with one call? the shader would be messy sure, but what about performance?

That’s good news! EXT_bindable_uniform is already really good to me.

V-man, I can give you a code sample, it’s not really a big deal … even if it took me a will to figure it out XD. [/QUOTE]

Go ahead. Where is it?

I doubt that. It can easily be detected whether branching is used or not, if it isn’t then glUniform should not cause a recompile.

Even in the case of conditional branching, this could possibly be optimized by compiling all conditional combinations, and under the hood bind different programs based on the uniform values.

Reason why this is slow. One logical assumption is that if you have good optimizing compiler for GLSL language, the easiest way to optimize out part of the program is to change the uniform to constant and let the optimizer do its job. Just as if the constant was originaly created by the shader creator.

Would that really make such a big difference? Most of the variables going on in shaders are rather dynamic anyway, why would accessing ‘uniform’ variables be slower than ‘varying’ variables? Constants may always be faster, but that’s irrelevant. The problem is that uniforms are ‘artificially’ slower than varyings (due to recompile), for reasons only nVidia knows.

From the NV GLSL release notes…

And on linking…

GLSL provides for multiple shader objects to be created, assigned GLSL source text, compiled, be attached to a program object, and then link the program object.
NVIDIA’s current driver doesn’t fully compile shader objects until the program object link. At this time all the source for a single target is concatenated and then compiled.
This means (currently) there is no efficiency from compiling shader objects once and linking them in multiple program objects. Unlike earlier drivers, the code will be parsed and syntax checked during the compile phase to allow the immediate reporting of errors. Some errors may still be deferred until link, but most should be available at compile time.

You might be able to detect few simple cases where this is yes or no. There are situations when some calculation is ignored not because simple if statement. It might be always multiplied by constant zero (for current combination of uniform values) resulting from longer calculation utilizing those uniforms. Plus you can not take advantage of free operations some hw might have (e.g. multiply by 2) nor optimize calculations which depend only on constants and values of uniforms.

This has the problem that you might end with compiling many combinations which might be never used. This can take very long time which is not a option. Recompiling when new situation is encountered defers that cost until specific combination is needed. At cost of the stall at that time.

I was talking about why the recompilation is slow not about access to the uniforms.

That’s good news! EXT_bindable_uniform is already really good to me.

V-man, I can give you a code sample, it’s not really a big deal … even if it took me a will to figure it out XD. [/QUOTE]

Go ahead. Where is it? [/QUOTE]

Here, just sample code next, at the end, all the final to build and run in on Windows with VC8 (through CMake) buut first some commnet on it:

  • This code purpose is just to show the feature.
  • In a real software, the idea of sharing the uniform buffer between a vertex and fragment shader … Maybe not.
  • The uniform buffer data should be considered as raw data, like any buffer. In the sample I use an array of vec4 just to keep it simple.

The vertex shader:


#version 120
#extension GL_EXT_bindable_uniform : enable

struct common
{
	mat4 MVP;
	vec4 Color;
};
bindable uniform common Common;

attribute vec2 Position;

void main()
{	
	gl_Position = Common.MVP * vec4(Position, 0.0, 1.0);
}

The fragment shader:


#version 120
#extension GL_EXT_bindable_uniform : enable

struct common
{
	mat4 MVP;
	vec4 Color;
};
bindable uniform common Common;

void main()
{
	gl_FragColor = Common.Color;
}

Uniform buffer init:


bool CMain::initBindableBuffer()
{
	std::size_t const BindableBufferSize = glGetUniformBufferSizeEXT(programName, uniformLocation);

	bindableData[4] = glm::vec4(1.0f, 0.5f, 0.0f, 1.0f);

	glGenBuffers(1, &bindableBufferName);
	glBindBuffer(GL_UNIFORM_BUFFER_EXT, bindableBufferName);
	glBufferData(GL_UNIFORM_BUFFER_EXT, BindableBufferSize, &bindableData[0][0], GL_STATIC_READ);
	glBindBuffer(GL_UNIFORM_BUFFER_EXT, 0);

	return true;
}

Get uniform buffer structure location:


uniformLocation = glGetUniformLocation(programName, "Common");

Uniform buffer use:

	
glBindBuffer(GL_UNIFORM_BUFFER_EXT, bindableBufferName);
glBufferSubData(GL_UNIFORM_BUFFER_EXT, 0, sizeof(glm::mat4), &bindableData[0][0]);
glUniformBufferEXT(programName, uniformLocation, bindableBufferName);
glBindBuffer(GL_UNIFORM_BUFFER_EXT, 0);

The working code is here:
http://groove.g-truc.net/g-tut-pack-ogl-dev.7z

Look at the sample called “ogl2x-buffer-bindable”

It uses CMake but have been tested only on VC8 … There is a lot more that just this bindable uniform samples but first it’s still under development and second it’s not really easy to extra an single sample the way its design.

Interesting points Komat.

While that is true, it is still questionable that compiling and linking is faster than a few dummy instructions, as Jan said.

And it should be job of the application/shader programmer to optimize for these situations.

This might be not just few dummy instructions. For example if the optimization decides that some calculation will not influence the output, it might remove all texture sampling, varyings or even hw upload of uniforms which are only used by that calculation. For some hw it might also free registers which would be otherwise reserved and gain improved parallelism as result.

If you expect that the combination will be used over many pixels and many frames then the gain might be significant and worth some compilation time. Problem is that programmer does not have control over when it happens. So it usualy happens at bad time and for uniforms which have wide range of possible values.

And it should be job of the application/shader programmer to optimize for these situations.

I agree. At least there should be an opt-out mechanism from recompilations which are not caused by hw limitations for those who attempt to prepare specialized shaders for most common situations.