Multiple shaders vs dynamic branching

I have a fairly complex shader that handles any combination of per vertex and per fragment lights, reflections, shadows and bumps. Of course this means that there are a number of branches (if…) both in the vertex and fragment shaders that depend on the value of several uniforms, and their values do not change frequently (per object or in some cases only per frame).

My GPU handles dynamic branching and I do NOT have to support older and lower end cards.

But I can also make several simplified versions of the same shader (e.g. only vertex lighting, only fragment lighting, no shadows, no reflections, no bumps, etc.) that basically would remove several ifs from the shader and would let the CPU execute them with the end result of the necessary glUseProgram to be called (plus possibly a few extra uniforms to set because of switching shaders). In theory it would make sense because most of the objects in my scene probably do not need all the advanced rendering features to be switched on, or at least not at the same time.

My question is, of course: Is it worth the effort to make a dozen simplified shaders or is dynamic branching so good that the gain in performance would be negligible?

Thanks.

I can’t give you an authoritative answer, but based on personal experience and observations I would certainly advocate removing as much branching / looping logic as you can from any shader.

An example I could use was when I had a single sphere for a planet surface, and had really complex shaders to deal with everything from clouds to land, to water, all on that one sphere. Switching to three spheres, each with specialized shaders for land, sea and clouds (and some logic on the client side) didn’t lose me any efficiency, but made the code far more readable, and easier to develop further. Later I actually settled with 3 spheres. The water and land combined on one, and the other 2 being for atmosphere and clouds. In the end the final solution was the most efficient overall.

Whilst the holy grail of stitching custom shaders together initially at run time is the perfect aim for any project IMO, the next best solution (again IMO) is small modular specialized shaders, rather than large “swiss army knife” style implementations.

That’s interesting. In our case, stitching custom shader bits at run-time has been a pain in the rear – when you commit to something like that, you give up being able to apply many optimizations or even review your shader efficiency easily. And when you want to rework your lighting and shading pipeline, …good luck!

For specific apps, I think there’s merit in the “swiss army knife” shaders (aka übershaders). However, the run-time ifs and compile-time ifs need to be easily selectable (without changing any shader code!) so that you can strike that run-time efficiency balance. Switching shaders ain’t free, but nor is evaluating run-time shader ifs (if you’re ever bottlenecked in that GPU stage).

Something like NVidia Cg’s CG_LITERAL shader parameters is probably ideal. From what I gather, you just use "if"s in your shaders. Before compiling your shader, you set the shader inputs you want to “compile into” your shader as CG_LITERAL, and magically they’re evaluated as constants, removing the appropriate ifs as needed.

No draw-time shader “reoptimization” as you change your uniform values nonsense… And you can precompile beforehand if desired, unlike with GLSL.

Well, maybe I misunderstand something.

I do not need to optimize the shader(s) at run time at all. I would make several simplified versions of my übershader for specific cases that require some or most of the optional features (ifs) switched off. All of these would be compiled just once. At run time, nothing more would happen then a change of the current program (glUseProgram) and setting a few uniforms if needed for each ‘object’ (3D model, text, etc.) rendered.

Are you saying this is a time consuming method and I could end up with less performance compared to using just one shader with ifs in its vertex and fragment shaders?

Not at all. The compiled result is the same. The issue is source maintenance.

If the number of hand-cooked shader permutations you have is small, the difference between what you and I are suggesting is minimal (and if the permutations are “totally” different, no question yours is the better way).

However, if the number of potential shader permutations is large (our case), with a lot of logic shared between the permutations, then lots of developer time will be wasted with your approach maintaining duplicate logic in different shaders.

OK, now I understand. The reason I asked about the whole thing is that I hope that I have finished the shader, at least I am not planning to add features for a while at least. It is based on GLSL 1.2, and I think the next time I will overwrite it will be when I move up to 1.4 (or later if available).

I also don’t plan to make separate shaders for each permutation, which would probably mean dozens or even hundreds. I will think about which 6-8 stripped down version would be used most frequently and I would only use and maintain those.

Based on what you wrote I suppose it is worth the effort performance wise.

If you make use of ZCULL and EarlyZ, could you get away with looping shaders (where # of iterations is given by a constant uniform value, definitely not per-pixel-dependent branches) ? Meanwhile make async texture-access.


uniform int NumPointLights;
uniform vec4 ptLights[100];

void main(){
    vec4 color0 = texture2D(tex0,varCoord0); // do NOT access color0 any time soon, or you'll make a dependency!
    float diffuse=0;
    for(int i=0;i<NumPointLights;i++){
        diffuse+=computePLight(ptLights[i]);
    }
    gl_FragColor = color0*diffuse; 
}

Here’s a silly method that you should definitely avoid:


    for(int i=0;i<NumPointLights;i++){
        if(lightType[i]==0){
            diffuse+=computePointLight(ptLights[i]); 
		}else if(lightType[i]==1){
			diffuse+=computeSpotLight(ptLights[i]); 
		}else if(....
		   ...
		}
    }

Minor tweaks to the code could be done, monitor the resulting nvASM or performance on different cards to find the best way to do looping.

With ZCULL and EarlyZ, at 1280x720 you would actually compute only up to 1280x720x4 fragments.

There are several very good points here. I read about Z-Cull before but I don’t know too much about it. But after just having read an article on the nvidia site, I got rid of the only discard I employed in the fragment shader.

My loops only depend on uniform variables fortunately (the number of lights defined for the scene, the number of shadow maps, etc.).

I will also keep in mind not to use the result of a texture access sooner than absolutely necessary.

However, the for… and if… sequence you wrote I should definitely avoid is very difficult to avoid without switching shaders. I allow up to 8 lights like the fixed function pipeline, but any of these can be either per vertex or per fragment, and any of the three usual light types. So testing the values of these uniforms cannot be avoided.

Unless the tests are done by the CPU and the result is switching to a shader that is the least complex but has the necessary options.

I think, it completely depends on the hardware. Wish SM4 hardware you are quite fine with large shaders especially when there is a lot of mathematics operation that covers texture fetch latencies.

Few big mistakes need however to not be done: Discard is really bad for an efficient work of the GPU because maybe marketing guys said a GF8 is a “scalar processor” but the registers are actually quite large (512 bits I think?). Alpha test need to be processed by a dedicated shader and a dedicated “pass”.

The gain for large shaders is to reduce the number of program binding and set of uniforms and samplers so that such shaders required even more a good sort of the objects to render before rendering.

Branching coherency is the big deal in SM4;
GF8 processes only scalars (so you’d better not do maths on vec4 components you won’t later use), and in a multiprocessor unit does the same operation on 8x2 fragments (four 2x2 quads) at once. If those 16 fragments stay together (branch at the same places), no more “threads” or “warps” are allocated. Branches on constants/uniforms thus keep calculations coherent. But if each fragment branches at a different place, the execution flow is no longer parallel, and becomes serialized. It’s like getting GF8400 performance out of a GF280GTX. Or worse, as there are limits to the number of warps that can be processed.
A branch is not slow, but cgc produces horrible nvasm code around it! It adds 3-5 instructions per loop, some of which are floor and conversion to integer (expensive ops), others just recompute an index again and again. Plus, during branches, the multiprocessor has to evaluate all 16 units for coherency.

Alpha-test is nothing special. It just marks the fragment as dead (this happens anyway to 2 or 3 of every 4 fragments around triangle edges anyway). The bad thing about alpha-te… ahem “discard”, is that it disables ZCULL and increases the latency of EarlyZ (in a way that the gpu can’t hide that latency). ZCULL is probably using an internal low-res grid of screen-aligned voxels (represented by 2 z-buffers), and deals only with bounding-boxes of primitives; it culls whole primitives - so if you discard fragments, and ZCULL is not disabled till next glClear(depth), it’ll start culling actually visible prims.

Just my observations and interpretations of docs.

I tried my “übershader” with different scenes and settings. Without switching any advanced options on (no per fragment lights, no shadows, no reflections and only one per vertex light) my shader can perform sometimes a little better than the fixed pipeline, but with certain scenes it can be a lot slower. I did not make any accurate timings, I only look at the performance tab of the task manager (this is windows XP). By lot slower I mean 20-25% more load on the CPU core handling the opengl context. For example, the same scene rendered by the fixed pipeline shows 8% load, but with my shader, 11-12%.

The scene is not even complicated. I find it strange that the CPU does quite a big part of the rendering even though all geometries reside on the GPU (in display lists).

Do you have a possible explanation for this and can you tell me what I should do?

Thanks.

I made four stripped down versions of my shader. They are all compiled and linked at initialization, and switching between them requires only glUseProgram and setting only the minimum necessary number of uniforms and even those only when their value has changed.

The simplest shader handles only one per vertex light, no shadows, bumps, etc. The others are somewhere in-between this and the full featured shader.

Although I tested it with only a handful of scenes, it seems that the benefit in terms of performance can be huge, and strangely - and I cannot explain this - the load on the CPU seems to be a lot lighter.

nVidia drivers recompile stuff. There’s no other explanation for that cpu load.
It’s one of the reasons I like to feed the drivers cgc-produced nvasm instead of GLSL.

Maybe the CPU load increased because the CPU is spending less time idle while waiting for the GPU to finish drawing? Are you using vertical sync? If not, the swap buffer call will block for less time with the faster shaders.

Does this mean that every time I set a shader as current (glUseProgram) the driver recompiles the shader? That sounds really bad.

I only use a window for preview purposes, otherwise rendering is done with FBOs and the final results is being read back to PC RAM then transferred to a broadcast video card.

Swapbuffer cannot be the reason for the CPU load, because when the CPU load is high, playback stutters, so the CPU really does something. If I use the full featured shader and I try the same scene with 1 vertex light then the exact same scene with the same light but set as per fragment, the CPU load can increase substantially (~20% on the core that handles opengl).

Since I added simplified versions of my shader with the best one chosen by the application for the geometry currently rendered, this worrying difference between per vertex/per fragment more or less disappeared. It seems switching shaders is a better choice than using one so called ubershader.

Try the Nvidia instrumented driver. Their GLExpert tool will output information in the console when shaders are recompiled (and when a lot of other stuff happens too). It seems they are recompiled when a certain set of conditions changes (texture format, uniform values, etc.) between two uses of a shader. Unfortunately, I haven’t managed to decisively find out whether a compiled version for a given condition set is kept “forever” or if the driver flushes it after a while.