GLSL Branching and Clamp function

awhig · July 1, 2009, 7:50am

It is advisable to sparingly use branching in shader code.

But glsl in built functions such as clamp , min or max inherently use “if” to determine final value.

My question: If for clamp(x , 0.0, 1.0) we replace it by

x = (x < 0.0) ? 0.0 : (x > 1.0)? 1.0 : x;

.

Will this be faster than clamp() ???

similarly, if min() and max() or step() are expanded using “if”, will there be a speed slowdown? If yes, then why such built-in functions are fast?

_NK47 · July 1, 2009, 8:13am

own functions will never be faster since built-in functions are optimized. you could do dot3 yourself as well but as built-in function on “dedicated” hardware it uses way less cycles to execute. always avoid writing your own code (reflect goes into same category and many others, ftransform). besides with a different GPU the speed of built-in functions might increase.

awhig · July 1, 2009, 8:34am

I see. it means there is some sort of hardware logic associated with in-built functions?

For example: if clamp() is considered, it may be implemented through logic gates … right ?

def · July 1, 2009, 8:43am

Exactly. Clamping has been available since day one (or two…) in programmable shading hardware. Long long before branching got introduced and supported in hardware.

awhig · July 1, 2009, 8:46am

thank you def and _NK47, now I understand.

Ilian_Dinev · July 1, 2009, 9:54am

Not only clamp/min/max are implemented in hardware and do no branching, but if you have simple code like this:
if(var1>1.33){
var2 = 7;
var3 = var4-5.0;
}

There will be no branching, either. Thanks to conditional execution (a flag specifying whether/when the instruction should be executed).
x86 cpus have CMOVxx instructions that do the same (but are limited to “mov”), and ARM cpus have exactly the same flags on every instruction.
Also, if real branching is done on all gpu cores at the same instruction (coherent branching), it only takes 2 gpu cycles. Coherent branching is obviously guaranteed if you loop uniform_N times. The slowness with uniform-looping comes mostly from the extra loop-preparation instructions that compilers still don’t optimize well enough.

awhig · July 1, 2009, 10:34am

@llian Dinev:

This is an interesting information.I never new that.
Does special conditional flag exist for gpus like amd , nvidia?

ok, if i use if-else pair and only “if” then will conditional execution take place for former?
Thank you again.

Brolingstanz · July 1, 2009, 11:22am

Cg’s command line compiler cgc will generate an assembly listing for your inspection. Otherwise I think you’re pretty much at the mercy of vendor perf documents and good old fashioned testing.

djmj10 · September 20, 2009, 2:06pm

Do they really use if clauses ?, this is some basic math we learn at college, should this not be faster then an if clause and an compare operation ?

max(x,y) = 1/2 * (x + y + |x - y|)
min(x,y) = 1/2 * (x + y - |x - y|)

someone compared original c++ max with these functions and achieved double performance with the above equations

Ilian_Dinev · September 20, 2009, 2:13pm

Ouch . No need for arithmetic like that.
GPU hardware is not as ridiculous as a 386 cpu. The silicon logic’s schematic for min/max/clamp is really easy, it’s just been missing from Intel cpus until SSE came.

The_Fiddler · September 20, 2009, 9:59pm

Not to mention that this implementation of min/max is prone to overflow and/or precision issues…

system · October 19, 2021, 7:30pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.