While reading posts here as well as on gamedev, I often find 2 kinds of posts as regards shader conditionals:
one is to combine all shaders into one megashader, then configure it, similar to the fixed pipeline,
people saying the addition of a single conditional statement (i.e. an if or a switch) to their shader, spoiled perf considerably. This happens often when people are discussing mobile platforms.
Obviously both approaches are correct, but they result in different perf. Do there exist rules of thumb as to how many conditionals are too many in shaders? Or how can I find out when the cost of switching a shader (which, I suppose, is fixed) outweighs the cost of a certain number of conditionals (but the cost of each conditional is probably not fixed).
Shader instances are usually run synchronously on multiple shader cores (16, 32, 64) so if the condition does not evaluate to the same value on all these instances then both branches are executed, just the saving of the results are masked out on the “if” branch for those invocations which evaluated the condition as “false” and the results of the “else” branch (if one exists) are masked out for those invocations which evaluated the condition as “true”.
So the main point is that conditionals that are coherent across shader invocations are faster.
What interesting insights! Probably the mobile GPUs don’t have a great many cores and hence the poor perf results from conditionals.
Suppose one writes a mega-shader. Is it useful to avoid unnecessary changes to uniform variables? So far, I haven’t done that, as I consider uniform changes to be cheap and it would complicate batching. I simply update all the uniforms after switching.
You’ve misunderstood me, as it was mentioned already.
Anyway, conditionals have poor performance on mobile GPUs because they use an architecture much more similar to early Shader Model 2.0 desktop GPUs that also suffered from poor performance when using conditionals because the condition evaluation stalled the cores till the results were available.
This is not really an issue in case of modern desktop GPUs because one core can have several shader instances active. In case one of these shaders is waiting for e.g. the result of a texture fetch or the evaluation of a conditional, another shader instance that has nothing to wait for can take its place. This is the so called “latency hiding” mechanism implemented in modern GPUs.
Uniform variable changes are not that lightweight, especially when they come in big numbers (that’s the reason we have uniform buffers now), but they are not the only reason to use a mega-shader. Also, with Shader Model 5.0 hardware we have shader subroutines which work like function pointers that allow splitting the execution path of shaders without conditionals.
Anyway, when and how to use conditionals heavily depends on the target GPU generation, whether we are talking about desktop GL or GL ES and many other things. On mobile GPUs I would rather not use conditionals as of now, but e.g. if you target GL3+ capable hardware, I would not bother that much of when to use conditionals as sometimes it can be advantageous if you can skip a few expensive operations using conditionals (e.g. in case of skeletal animation, skipping the calculations for bone matrices that would have 0 weight anyway).
Thank you both for the clarification. I new about the problem, but I think it is not related to the hardware architecture, but rather to the drivers implementation. I have to admit that SM 3.0 cards have smaller number of registers, and that probably leaded to some kind of optimization.
The current state of NVIDIA drivers concerning optimization is quite bright. Last night I tested uniform usage, and here are the conclusions:
drivers eliminate uniforms changes if there is no drawing calls for the current program (shader),
drivers also eliminates superfluous uniform setups (setting to a same value).
Everything is tested on 8600M GT graphics card (SM 4.0) with R266 drivers.
Several months ago I take some time to optimize a shader of mine with lots of trigonometric functions calls by caching calculated values and removing calculations for the same values. Can you guess what performance boost I achieved? None! GLSL compiler is highly optimized. I wonder what will happen if we could disable optimizations in both GLSL compiler and drivers? It would be much harder for us, the programmers, but maybe we could squeeze a little more performance, and remove potential bugs in the drivers optimization. But currently I have no objections on NVIDIA drivers optimizations.
I am fairly certain it is a hardware limitation in that generation of Nvidia hardware - as I have worked a lot on PS3 that uses the same chip.
Basically, there are no uniform registers on that generation of pixel shader hardware - only inline constants that have to be updated.