glUniform is slow?

Certainly there are situations where compiling with a known uniform value can optimize a lot of code away. However such situations should be easily detectable. Just count the number of ALUs and TEX instructions inside a branch, then decide whether it is actually worth it to optimize it away and finally detect on which uniforms (if any) the branch depends and tag THAT uniform. This should allow to only recompile a shader when that uniform CHANGES (the driver should detect which values are known already and cache the results). This might result in a few slower frames when the shader is used first, but shortly thereafter it should run at full speed.

Right now it seems that nVidias driver simply recompiles the shaders every time ANY uniform is changed. I might be wrong, but at least it looks that way.

Jan.

And what if there is no branch?


float base = SomeComplexCalculation() ;
float exponent = SomeCalculationInvolvingUniforms() ;
float value = pow( base, exponent ) ;
gl_FragColor = vec4( value ) ;

If some uniforms cause the exponent to become constant zero, then the pow can return constant 1.0 (the 0^0 case is undefined by the GLSL so 1.0 is valid result) and entire calculation colapses to


gl_FragColor = vec4( 1.0 ) ;

Right now it seems that nVidias driver simply recompiles the shaders every time ANY uniform is changed. I might be wrong, but at least it looks that way.

It seems like it recompiles when it encounters an new combination of uniforms in program object although there might be some cases where it does nothing.

Hey Jan, are you seeing this on a GeForce 8+ card?

Just checking. We’ve definitely been bitten hard by this thing on GeForce 7 cards, but not 8 yet. In a previous thread in May '07, Jackis said this problem only inflicts GeForce 6 & 7 (NV40-class cards), not NV30 or G80+. Just want to make sure everyone’s collective evidence is still supporting this assertion.

I am working on a Geforce 9600 (windows) and sometimes on a Quadro 8800 (or how that one is called) (linux), both are slowed down tremendously when glUniform is called too often. We are only using about 30 different shaders, some do contain if’s but not really many (like one or two at most). If the driver were to cache already known configurations, i think it would have all of them after a few frames. But instead i got a 30% speed decrease with my first (naive) implementation, now i cache uniform-changes myself, to prevent this decrease.

Jan.

Ouch! Thanks for the info. Question is whether it’s the CPU overhead of setting uniforms, or whether there’s actually shader recompiles going on there.

In our most recent case on 7950GTs, it was definitely a first-time-render slowdown of up to 124ms per draw call inside the first draw call per shader use after setting new shader uniforms. Had already prerendered with the referenced textures and with those shaders before, albeit with different non-sampler uniform values. The number of uniforms involved here is tiny (< 10). Forcably rendering with those shaders after each major uniform update (really nasty kludge) at a point where we could afford that hideous resulting freeze did move the time out of user interaction/flight mode.

We have not seen this behavior on GeForce 8+ cards…yet. But sounds like we should start actively looking for this.

I find it hard to believe just pushing < 10 uniforms down into constant register spots in the hardware could cost anywhere near 124ms, unless the shader was being recompiled inside GL. But that’s a guess.

It might be a lame question:

I have to switch between several shaders during the rendering of every frame (all of them compiled and link, the switch involves only glUseProgram).

Do I need to reinitialize the values of the uniform variables every time or do they ‘hold’ their previously set values?

Thanks.

Each shader program has its own memory to store the value of its uniform variables. Once a uniform value is set for a shader program, it doesn’t change until you modify it again.

Wow, thanks. I am glad that I asked because I thought I have to reinitialize my uniform variables every time I select a shader. I haven’t noticed the difference in speed so I thought that all is well.

I think that I probably should ‘mirror’ the uniform variables in PC RAM and only change the real ones in shader memory if there is a change.

Thanks once again.

In my opinion it MUST be slow, why? Because uniform variables are sent from the CPU to the GPU for each primitive. If you dont use uniforms, VS and FS do not have to wait for external variables and they work with their own graphic card memmory. With PCI Express respecting to AGP it has improved a lot, but it still should be inefficient. If you can avoid it dont use uniform variables or use built-in variables.

I am almost sure about this, later I will see in my documentation to confirm it…

“Because uniform variables are sent from the CPU to the GPU for each primitive.”

That is entirely wrong. Uniforms are sent to the GPU once when you update them. “built-in variables” are mostly uniforms under the hood, as well, except for the per-vertex attributes (which are attributes…). The performance is the same, whether you read out some user defined uniform or for instance gl_ModelViewMatrix. Uniform UPDATES (not their usage!) are slow for several reasons, one being that they can significantly stall the pipeline. Your personal opinion has fortunately no influence on their speed.

Jan.

The reality is that it varies with hardware generation. GL3 level hardware can directly fetch uniform values from GPU memory, older generations cannot (they have a register file which must be injected with data per batch, if the uniforms have changed from batch to batch). This is my best understanding of it at present.

You possibly have the reason but it was not opinion but experience. In a shader program I changed three uniform variables in the fragment shader for three uniform variables in FS and three varying variables to VS and the performance doubled with AGP and it was better (not too much) with PCI Express. The value of the variables vary in each primitive. As a result of this, I concluded what I said.

Then, could I say?

“Because a uniform variable is sent from the CPU to the GPU for each update.”

would be correct?

I dont understand the following:

“The performance is the same, whether you read out some user defined uniform or for instance gl_ModelViewMatrix. Uniform UPDATES (not their usage!)”

You are saying the important thing is the update and not the read. If this is true, GLSL performance should be much worse than GC since GLSL have a lot of built-in uniform variables and they are loaded independently if they are read or not… do you follow me? or maybe only are updated the built-in uniforms that will be used?

Really this is something too important so it would be good to know it very well

" In a shader program I changed three uniform variables in the fragment shader for three uniform variables in FS and three varying variables to VS" - i honestly don’t know how you mean that, please try to explain it more clearly.

That performance on a graphics card with PCI-Express is better than on a card with AGP is very possible. However you cannot conclude, that uniforms are the only thing that get faster with PCI-Express.

When the uniform values vary for each PRIMITIVE (e.g. triangle) than uniforms are definitely the wrong thing to use anyway. Use vertex-attributes for such cases, they are MEANT to vary for each primitive, whereas uniforms are meant to stay constant for a BIG batch of geometry (like 100 triangles and more).

If your uniforms vary for each primitive, that means you need to send each primitive with its own drawcall (or even with immediate mode). If so you are using the slowest possible way to render things anyway (and in such cases PCI-Express might actually give you a lot more speed indeed, but it will still be slow).

“Because a uniform variable is sent from the CPU to the GPU for each update.”

Yes, that’s true, at least on “modern” GPUs, as Rob mentioned above.

“If this is true, GLSL performance should be much worse than GC since GLSL have a lot of built-in uniform variables and they are loaded independently if they are read or not… do you follow me? or maybe only are updated the built-in uniforms that will be used?”

With GC you mean nVidias CG i guess? GLSL-compilers (just as the CG compiler) analyze the code for uniform usage. Uniforms that are never used (or used but can be optimized away) will not be provided by the driver (they are pretty smart). You can actually try that out for yourself, just put a uniform into a shader, but don’t access it, at all. In your app query the driver for the uniforms location (glGetUniformLocation i think). It will return -1 (ie. “not existing”) because it optimized the uniform away.

Of course it does also matter how many uniforms you use, that means a shader that accesses 10 uniforms will be faster than one that accesses 50 uniforms. But this is only a minor problem, compared with the time it takes to update those uniforms. That means, if you have a shader that reads 50 uniforms, but those NEVER change, it might actually perform much faster, than a shader that accesses only 10 uniforms, if those uniform change all the time.

Hope that helps you,
Jan.

I meant that in the beginning I had a program where my three uniform variables were sent to the FS where I used them for my operations.

Later I have built a program which has the same final result but better performance. I send my three uniform variables to the VS. In the VS I save the value of the uniform variables in varying variables. I use the varying variables in the FS for my operations.

The value of these uniform variables varies each primite but not by vertex (with primite i mean each figure that is between gl_begin and gl_end). I cant put more than a figure in a begin/end block.

With that information, are you still thinking I am using the slowest possible mode?

Surely you have reason. The card with AGP is GL 2 and the card with PCI-E is GL 3. Then…

I dont understand the part of “they have a register file”… where is it? in the CPU? GPU? Why is slower to fetch from a register file? If you know a reference that explains it I would appreciate it.

Thanks for your time.

Well, as you mentioned, you use “glBegin / glEnd” to render polygons. That’s “immediate mode”. It is the easiest, but by far slowest way to render things. You should take a look at “vertex arrays” especially using the “vertex buffer object” extension (VBO). It will give you a huge speed up.

Jan.