Is it possible to automatically combine several shaders together?

rongguodong · May 27, 2015, 9:46am

I see the traditional fixed pipeline functions are deprecated in new OpenGL core profile, and everything is done by shaders. However, it seems it is quite difficult to deal with different situations. For example, I may want to render an object with or without texture, with or without lighting effects. In the old days, we can simply enable/disable the corresponding features. With the new OpenGL, it seems we have to either write many shaders (one for each possible combination), or write a so-called ubershader with lots of switches or #ifdef.

Ubershader seems quite difficult to maintain and is not efficient. But the many-small-shader approach seems impractical if I have many features: if I have n features to be turned on or off, there would be 2^n possible combinations, and I need to create 2^n shaders for them.

Is there a better way to do that? Is it possible to automatically combine several shaders together? For example, I may write a simple shader to have texture only, and another simple shader to have lighting effect only. If I want to have both texture and lighting effect, I can combine them together and do not need to write a new shader for that. Is it possible?

If yes, how to get it? If no, what is the better way?

GClements · May 27, 2015, 10:07am

No, you cannot automatically “merge” shaders.

Often, it’s possible to “disable” an operation simply by the choice of parameters. E.g. binding a solid white texture (which need only be 1x1) has the same effect as disabling texturing. Disabling all lights and setting the ambient intensity to 1 has the same effect as disabling lighting.

For operations which lack any kind of identity value, you can just conditionalise the operation on a boolean uniform variable (effectively implementing the equivalent of a glEnable/glDisable option).

rongguodong · May 27, 2015, 10:50am

Hi GClements,
Thanks for your reply first! What you described is just the ubershader. For simple cases with few features (e.g. texture and lighting as above), it may be fine. But if I have lots of features, the ubershader will either have lots of unnecessary computations (if using identity values) or be inefficient (if using condition switches).

I found an online slides talking exactly what I want. This slides is from EuroGraphics 2007, eight years ago! Do we have something similar to it now after eight years?
(It seems I cannot post URL in my reply. The online slides can be found by simply searching its title “Automated Combination of Real Time Shader Programs”)

Alfonse_Reinheart · May 27, 2015, 10:59am

Do you have proof of that?

I hear lots of people saying that uniform-based conditional branches are slow and so forth. Have you actually measured the difference?

If you have so much code in your shader, with thousands of opcodes executed along any particular codepath, what makes you think that having a couple of uniform conditional branch is going to be the decisive factor between being fast and being slow? Is this merely a supposition, or do you have the profiling data to back it up?

There is a difference between conditional branches based on uniforms and conditional branches based on input values. If a branch is based on input values, then it is possible for different invocations to follow different codepaths. Therefore, any such condition could cause wavefronts to be split in order to resolve the results.

Uniform conditions do not. Every invocation of that shader stage for that rendering command will execute the exact same codepath (with respect to uniform conditions). No such splitting need occur. And therefore, the cost of the branch will purely be the cost of a branch. Which will be relatively minimal.

GClements · May 27, 2015, 2:46pm

On the other hand, switching between many different shaders also has overhead.

The preferred approach for performance-sensitive applications seems to be to use a few generalised shaders (i.e. “ubershader”) rather than many specialised shaders. But such code also maximises the benefits of that approach by rendering large amounts of geometry with few draw calls. If you’re going to be splitting draw calls even when they use (or could use) the same shader, the balance may change.

[QUOTE=rongguodong;1266671]
I found an online slides talking exactly what I want. This slides is from EuroGraphics 2007, eight years ago! Do we have something similar to it now after eight years?
(It seems I cannot post URL in my reply. The online slides can be found by simply searching its title “Automated Combination of Real Time Shader Programs”)[/QUOTE]
That’s basically the ubershader approach, but with preprocessor conditionals (either the GLSL preprocessor or a custom preprocessor) rather than run-time conditionals.

Note that the implementation may do this (dynamic shader re-compilation) automatically for shaders which include conditions based upon uniform expressions. This was quite common on older hardware which didn’t provide branch instructions, but there isn’t really much need on modern hardware.

And that still has the problem that n features may generate up to 2n distinct shaders. Having the implementation do this automatically wouldn’t change that (shader generation is sufficiently complex that you wouldn’t want to regenerate them every frame).

If there’s a case where the branches matter, it will be where you have a complex shader but with most of the features disabled. For complex shaders with most features enabled, computation will dominate; for simple shaders, memory bandwidth will dominate.

rongguodong · May 27, 2015, 2:56pm

Hi Alfonse,

Thanks for your comments! Your argument sounds quite reasonable, and I do not have experimental numbers. I just read lots of people saying branch, particularly dynamic branch where the condition is evaluated during the run-time, is generally not good for shaders. The other approach of ubershader is to use #ifdef to achieve “static branch”. However, for both approaches, I have to write a huge shaders with complicated logic covering all possible combinations of conditions. I think the scenario described in the EuroGraphics slides is ideal: we can write lots of simple shaders and have compiler to combine them together. But it seems it is not possible with current OpenGL?

mhagain · May 27, 2015, 4:18pm

This is more of a theoretical problem than a practical one. In reality you’re going to find that with n features you’re not going to need every single combination, and that many combinations don’t actually make sense to be used together. One viable approach is run-time generation of shaders from shader fragments, coupled with a caching mechanism to check if a requested combination has been created before (and just return the previously created program object instead of creating it again). glShaderSource is essentially designed for this kind of usage, and I understand that this is what Unity (at least in older versions) does for it’s own fixed pipeline emulation. If you want to speed things up you can pre-generate the most common combinations during startup (you’ll probably find that there are 2-3 combinations that cover maybe 95% of what you need to draw), then lazily generate the least common ones as-required.

This too. Don’t underestimate the usefulness of 1x1 textures, or setting uniforms to 0.

A relevant point: on modern desktop GPUs the old fixed pipeline actually no longer exists in hardware, and it’s all emulated by shaders. Have a think about that, and it becomes obvious that drivers must do so by implementing these kind of techniques.

Firadeoclus · May 28, 2015, 1:35am

The cost of the branch itself is tiny on all modern GPUs. However, there are other costs to having effectively dead code (for a given draw call) appear as live:

Register pressure. GPUs usually allocate registers statically, so the number of registers reserved per shader invocation depends on the worst case path through the shader.
Unused in/out variables that the linker can’t optimise away, potentially increasing bandwidth (internal and/or external, depending on the GPU architecture and shader stage) and cache requirements.
The effect of statically using some shader features on the rest of the pipeline, such as clip distances, discard, early fragment tests, or writing gl_FragDepth.

Alfonse_Reinheart · May 28, 2015, 8:39am

Register pressure. GPUs usually allocate registers statically, so the number of registers reserved per shader invocation depends on the worst case path through the shader.

Well, that seems like a quality-of-implementation issue. After all, it’s not like the compiler can’t see a uniform branch in the code; it’s right there. So the compiler ought to be perfectly capable of realizing that if one branch is taken, the other will not be, for any instantiation in the rendering command. And that information ought to be factored into register assignment. Obviously registers are statically assigned, but there are ways to use the same registers in different, mutually exclusive, branches.

And with more developers using ubershaders, there is every reason for IHVs to take that information into account.

Unused in/out variables that the linker can’t optimise away, potentially increasing bandwidth (internal and/or external, depending on the GPU architecture and shader stage) and cache requirements.

Errr… I’d want to see some evidence for that.

Remember: what defines the logic for what gets pulled from buffers is VAO state, not shader state. Yes, even on AMD hardware where vertex pulling happens via shader logic. What they have to do is modify the shader in-situ by adding some prefix code to handle vertex pulling logic. But that shader prefix is defined by the VAO state (since it has to respect the formatting). So, if an input isn’t being fed by the VAO, then there’s no reason for the vertex pulling logic to pull it.

The effect of statically using some shader features on the rest of the pipeline, such as clip distances, discard, early fragment tests, or writing gl_FragDepth.

The goal of ubershaders is not to reduce the number of shaders to 1. It’s to reduce it to a fixed, preferably small, number of shaders, so as to minimize shader construction and state changes. You want to render lots of objects with an ubershader, but that doesn’t mean you don’t have specific ubershader variants.

So an engine might have 4 actual variations of ubershaders that can handle different kinds of things that take up resources. Clip distances and depth writing would be such variants, as only very specialized objects generally need such features. These are generally defined by the nature of the object itself.

discard is the one that is most like to vary based on arbitrary elements of the object’s data, rather than being intrinsic to the object itself. You’re more likely to want to use discard for things like alpha-testing and the like, which is based on on properties in the texture, not the object.

Firadeoclus · June 1, 2015, 2:17am

Note that I wrote “worst case path”. If one path through the shader peaks at 100 live registers while another uses only 4, the scheduler will still statically allocate 100 registers for each instance.

Now an implementation could do some runtime register allocation trickery based on uniform values, but I wouldn’t rely on that.
I would not necessarily want an implementation to do that, either.

Errr… I’d want to see some evidence for that.

Remember: what defines the logic for what gets pulled from buffers is VAO state, not shader state. Yes, even on AMD hardware where vertex pulling happens via shader logic. What they have to do is modify the shader in-situ by adding some prefix code to handle vertex pulling logic. But that shader prefix is defined by the VAO state (since it has to respect the formatting). So, if an input isn’t being fed by the VAO, then there’s no reason for the vertex pulling logic to pull it.

Not vertex pulling, but interfaces between shader stages.

The goal of ubershaders is not to reduce the number of shaders to 1.

Indeed. But to understand that you need to know that the cost of uniform branches (but not the branch instruction itself) is sometimes quite significant. Otherwise there would be no reason not to use a single ubershader.

Alfonse_Reinheart · June 1, 2015, 8:31am

That’s a fair point. But you also need to balance that against the cost of changing programs.

Usually, the interfaces between shader stages are the same, even for ubershaders. The variables for ubershaders don’t tend to require different amounts of data to pass between stages. Sure, there are things like normal mapping where you need a tangent-space basis. But generally speaking, the vertex processing outputs and the fragment shader inputs are more or less the same.

And if you just uniformly bumpmap everything, you don’t even need that variation

Not necessarily.

If some meshes are skinned and some are not, you want two separate ubershaders for them. That’s not because of “the cost of uniform branches”; it’s because the vertex shader needs different kinds of data.

More often than not, it’s obvious when you should make an option an ubershader variant or a new shader. These would be things like:

Requires special resources (more per-vertex data, UBO/SSBOs/etc that aren’t shared among other things).
Rare and/or specialized cases (clip-distances/depth writing).
Presence of the option harms performance by its static presence (discard).

It’s usually not specifically because of “the cost of uniform branches”.

Firadeoclus · June 2, 2015, 8:20am

[QUOTE=Alfonse Reinheart;1266769]Usually, the interfaces between shader stages are the same, even for ubershaders. The variables for ubershaders don’t tend to require different amounts of data to pass between stages. Sure, there are things like normal mapping where you need a tangent-space basis. But generally speaking, the vertex processing outputs and the fragment shader inputs are more or less the same.

And if you just uniformly bumpmap everything, you don’t even need that variation ;)[/QUOTE]
If you just uniformly apply features everywhere, you’re leaving the context of this thread.

I find it hard to believe that ubershaders which use all their interface variables all of the time are as common as you claim. I wonder what data you base this on.

Not necessarily.

If some meshes are skinned and some are not, you want two separate ubershaders for them. That’s not because of “the cost of uniform branches”; it’s because the vertex shader needs different kinds of data.

But that’s true for most features you’d want to enable/disable in an ubershader. Need a lightmap, an environment map, detail map, parallax map, etc.? Skinning is not at all special in that context. As long as the additional data you need is only pulled, not pushed, their static presence won’t increase bandwidth requirements. But you need to know when that is the case.

More often than not, it’s obvious when you should make an option an ubershader variant or a new shader. These would be things like:

Requires special resources (more per-vertex data, UBO/SSBOs/etc that aren’t shared among other things).

Rare and/or specialized cases (clip-distances/depth writing).

Presence of the option harms performance by its static presence (discard).

It’s usually not specifically because of “the cost of uniform branches”.

None of those things are necessarily obvious to someone trying to decide how to implement a set of independently controllable features in shaders. The point of my first comment in this thread was precisely to point out those not-so-obvious cases which might impact performance even though a uniform branch itself is practically free on modern GPUs.

Alfonse_Reinheart · June 2, 2015, 9:40am

There is a difference between needing a texture to be bound to a binding point and needing additional per-vertex data. Especially since OpenGL has those “default attribute” values that are provided if no array is attached to a particular vertex attribute index. So you can’t say that there’s no cost to doing this.

Furthermore, there may not be a cost to having extra attributes lying around, but there is a cost to changing vertex formats. And on some hardware, that cost is rather substantial. And the AZDO presentation strongly suggests (at least on one piece of hardware) that this upwards of half the cost of a full shader change.

Ubershaders, first and foremost, are an optimization. Oh sure, they’re kinda nice to use and all, but they primarily exist to make your code faster by avoiding the overhead of program changes. Therefore, you would only seriously use them if performance is a significant concern.

This means that you will only be successful at optimization if you actually know certain things about hardware. And the facts you cited aren’t that important for performance. Knowing about “register pressure” is far less useful for optimizing ubershaders than knowing that a static discard turns off early depth testing. That fact alone tells someone implementing ubershaders that, if they want discarding, they must make that a separate shader. The possibility of “in/out” variables that aren’t always used in every branch is less useful for optimizing than the fact that extraneous “clip-distances” aren’t free. That tells the ubershader user that objects which need clip-distances should use a separate shader. The same goes for things like the cost of vertex format changes.

And at the end of the day, anyone who is serious about optimizing will benchmark variations of some things being static and some being dynamic. They don’t go purely off of rules of thumb, but those are vital places to start.

The concerns you cited aren’t a priori useful. They’re good for explaining a result the user has already seen through benchmarking. Register pressure might explain why adding one more option causes a significant loss of performance. But you can’t use it to say, “I shouldn’t add more than X numbers of options” the way that you can say, “I must make discarding shaders separate”.

Firadeoclus · June 3, 2015, 2:52am

Incomplete textures have default values, too. Could you make it clearer what your point is?

Furthermore, there may not be a cost to having extra attributes lying around, but there is a cost to changing vertex formats. And on some hardware, that cost is rather substantial. And the AZDO presentation strongly suggests (at least on one piece of hardware) that this upwards of half the cost of a full shader change.

Yes, changing vertex formats can be costly. Is this relevant in the context of whether to use ubershaders?

Ubershaders, first and foremost, are an optimization. Oh sure, they’re kinda nice to use and all, but they primarily exist to make your code faster by avoiding the overhead of program changes. Therefore, you would only seriously use them if performance is a significant concern.

Since they’re “kinda nice to use” it’s easy to see why someone would use ubershaders even if performance isn’t a significant concern (yet).

This means that you will only be successful at optimization if you actually know certain things about hardware. And the facts you cited aren’t that important for performance. Knowing about “register pressure” is far less useful for optimizing ubershaders than knowing that a static discard turns off early depth testing. That fact alone tells someone implementing ubershaders that, if they want discarding, they must make that a separate shader. The possibility of “in/out” variables that aren’t always used in every branch is less useful for optimizing than the fact that extraneous “clip-distances” aren’t free. That tells the ubershader user that objects which need clip-distances should use a separate shader. The same goes for things like the cost of vertex format changes.

I mentioned discard and clip distances, so I’m not sure how you can say that “the facts you cited aren’t that important for performance”.

Knowing about register pressure and the possible cost of unused in/outs absolutely is important for optimising your shaders, a-priori choices or not. You can’t just make a-priori decisions for using discard or clip distances, either, since they could be enabled 90% of the time and the cost of switching shaders might be higher than just having them sit in inactive branches in the shader 10% of the time.