Ubershader and branching cost

Hello,

I have an ubershader for my whole rendering pipeline system that supports the materials the objects use.
Therefore, certain properties which can be toggled/enabled such as having normal,flow,reflective maps or event certain preprogrammed effects (distortion displacement etc)

The way these effects are applied is by setting uniforms that are checked by an if/else in the shader code. Now I know the ifs should be avoided when using some of them mostly because of branching.

So I have 2 options,

  • either perform the if/else in the shader (GPU)
  • or perform the if/else in the program (CPU)

for the first one, branching can have an inpact on the performance depending on the hardware from what I’ve read, and what else more…?

for the second one, stante changing can have another impact on the performance, as I’ve also read that changing too much over shaders can have a huge rendering cost, and assuming I can’t sort the objects by shader when they already have a different sort criteria.

Thus my question, is there a middle ground to solve these issues, or should just try tocreate different shaders dynamically for each macro material type?

Here’s an example of what I’m talking about, with some of the branching

	if(is_dithered(gl_FragCoord.xy, vTexI)){discard;return;}

	if(object.splatting){
		albedo     = splatMix(object, object.sp_alb_textures[0], object.sp_alb_textures[1], object.sp_alb_textures[2], object.sp_alb_textures[3], textureLevel, false);
		normal	   = splatMix(object, object.sp_nor_textures[0], object.sp_nor_textures[1], object.sp_nor_textures[2], object.sp_nor_textures[3], textureLevel, false);
		emission   = splatMix(object, object.sp_emi_textures[0], object.sp_emi_textures[1], object.sp_emi_textures[2], object.sp_emi_textures[3], textureLevel, false);
		specular   = splatMix(object, object.sp_spc_textures[0], object.sp_spc_textures[1], object.sp_spc_textures[2], object.sp_spc_textures[3], textureLevel, false);
	} else {
		albedo     = object.albedo   ? texture(object.textures[0], vTex1, textureLevel) : WHITE4;
		normal	   = object.normal   ? texture(object.textures[1], vTex1, textureLevel) : WHITE4;
		emission   = object.emission ? texture(object.textures[2], vTex1, textureLevel)*vec4(object.emissionColor, 1.0) : vec4(object.emissionColor, 1.0);
		specular   = object.specular ? texture(object.textures[3], vTex1, textureLevel)*vec4(object.specularColor, 1.0) : vec4(object.specularColor, 1.0);
	}

    ...

	if(object.rcv_fog){
		result.rgb = fogShade(color, vPos, vVert, environment.fog, object.additive, object.fog_intensity);
		if(emit > 0) emission = emission * fogValue(vPos, vVert, environment.fog, object.fog_intensity);
	} else result.rgb = color;

Depending on the complexity of your shaders, it can sometimes be more beneficial to set a 1x1 all-white/black/grey texture, uniform value of 0 or 1, or whatever, and always run the code that would otherwise be a branch. That way you get to avoid branching, avoid a shader permutation and change, but at the cost of some extra arithmetic (which can be very cheap) or a 1x1 texture lookup (likewise). It’s certainly worth benchmarking, and since you already have the shader code, should be easy and non-disruptive enough to test.

+1 what mhagain said. The key being when it’s cheap/free or gains you perf (which sometimes happens).

Not necessarily. A runtime-evaluated GPU-side branch is not a-priori bad, whether conditional or unconditional. And in some cases, adding one can greatly performance. But you have to look at each one and think like the shader compiler would. That is, consider “what’s actually going to happen on the GPU when I do this”. (Related: See this fun and educational Twitter thread.)

Sadly, there’s no great middle ground that solves these issues yet (AFAIK). You just have to look at each on a case-by-case basis. This IMO is one of the areas GPU and GPU driver technology has really fallen behind. In any non-trivial real-time GPU rendering system, the amount of dev time this single issue consumes is absolutely ridiculous.

Rather than re-invent the wheel here, let me point you to some good “food-for-thought” reading on this issue:

Since you obviously care about maximizing performance, one aspect of this “shader permutation” issue to be very wary of is when the driver forceably injects separate shader permutations on-the-fly behind-the-scenes into your nice, neat, carefully-thought-out shader permutation graph, to the detriment of your application’s performance. If you don’t know about “state-based recompiles” you should read up on them.

Beyond the above, here are a few “State-based Recompile Links”:

With OpenGL, you have to suffer with this kind of thing happening at draw-time, … if you don’t foresee it and predesign a fork in your shader permutation tree to avoid it.

Vulkan initially didn’t solve this problem; it just forced it all to happen on startup, with the full “recompile all the shader permutations” perf hit being taking whenever you update the graphics driver. That is, unless there was some already-populated net DB being queried to bulk-download precompiled shaders directly into your driver’s shader cache. However, recently Vulkan has made some great strides toward providing real software solutions to this “GPU shader permutation problem”:

Unfortunately, this is Vulkan only. But it does provide one more compelling reason to switch to Vulkan.

Thanks for the reply, although that seems fairly counterintuitive performance-wise, I’ll check that out to see if it works.

Thats what I thought.

I’ve seen a lot of these while browsing about this issue. And some are stating the same.

I will try to check the times and cost for some examples, and even try the preprocessor ifdefs and check that