own functions will never be faster since built-in functions are optimized. you could do dot3 yourself as well but as built-in function on “dedicated” hardware it uses way less cycles to execute. always avoid writing your own code (reflect goes into same category and many others, ftransform). besides with a different GPU the speed of built-in functions might increase.
Exactly. Clamping has been available since day one (or two…) in programmable shading hardware. Long long before branching got introduced and supported in hardware.
Not only clamp/min/max are implemented in hardware and do no branching, but if you have simple code like this:
if(var1>1.33){
var2 = 7;
var3 = var4-5.0;
}
There will be no branching, either. Thanks to conditional execution (a flag specifying whether/when the instruction should be executed).
x86 cpus have CMOVxx instructions that do the same (but are limited to “mov”), and ARM cpus have exactly the same flags on every instruction.
Also, if real branching is done on all gpu cores at the same instruction (coherent branching), it only takes 2 gpu cycles. Coherent branching is obviously guaranteed if you loop uniform_N times. The slowness with uniform-looping comes mostly from the extra loop-preparation instructions that compilers still don’t optimize well enough.
Cg’s command line compiler cgc will generate an assembly listing for your inspection. Otherwise I think you’re pretty much at the mercy of vendor perf documents and good old fashioned testing.
Ouch . No need for arithmetic like that.
GPU hardware is not as ridiculous as a 386 cpu. The silicon logic’s schematic for min/max/clamp is really easy, it’s just been missing from Intel cpus until SSE came.