# Fast math on a vectors components

i have a vector say
vec4 a(1.0,2.0,3.0,4.0);

whats the best way of finding the product/sum of its parts ie

float answer = a.x * a.y * a.z * a.w;
float answer = a.x + a.y + a.z + a.w;

a.x+a.y+a.z+a.w is the same as dot4 between A and vec4(1).

a.xa.ya.za.w is harder; you probably need to swizzle and multiply twice. I e, a.xy = a.xza.yw; a.x *= a.y; It’s un-clear whether this is any faster than just writing out the expression.

doh, yeah should of gotten the dotproduct one

cheers this is quicker (5 instructions less from a quick check)
a.xy = a.xz*a.yw; a.x *= a.y

 hmmm i thought 5 was a bit much at the time, seems i forgot to multiple by another result which since i wasnt using the result was getting optimized away

This reminds me of C compiler technology
from 20 years ago, when (ab)(cd) could
compile to significantly faster code than
a
bcd on some platforms.

The GLSL compiler will probably never be very
good at optimising expressions, I guess, because
it has to be simple and quick enough to execute
entirely at application runtime.

There could be a need here for a code optimiser
to transform human authored GLSL code to some
more optimal GLSL code to hand feed the
compiler. Assembly should be a thing of the
past now when GLSL is here, but we still end
up exchanging ideas on how to hand feed the
compiler to trim down the number of assembly
level instructions for specific targets, so
there is definitely a need for better
optimisation tools here.

I thought I’d never say this, but in this
particular respect, the precompilation of
HLSL does seem like a better platform for
more complicated expression optimisations.

Originally posted by zed:
[b]doh, yeah should of gotten the dotproduct one

cheers this is quicker (5 instructions less from a quick check)
a.xy = a.xz*a.yw; a.x *= a.y[/b]
I’d recommend this instead to make the swizzles more friendly with ATI cards:

a.xy *= a.wz;
a.x *= a.y;

Originally posted by StefanG:
This reminds me of C compiler technology
from 20 years ago, when (ab)(cd) could
compile to significantly faster code than
a
bcd on some platforms.

abcd is essentially a(b*(cd)). There’s no parallelism possibly there without breaking the C standard (I guess some compiler flag could allow that though). (ab)(cd) on the other hand allows ab and cd to be computed in parallel on superscalar FPUs, which could be up to 50% faster.

Originally posted by StefanG:
The GLSL compiler will probably never be very
good at optimising expressions, I guess, because
it has to be simple and quick enough to execute
entirely at application runtime.

Don’t know about that. It’s pretty good already. Yes, there are some corner cases where you need to tweak the code a bit for the compiler to see optimization opportunities, but most of the time the GLSL compiler does a very good job already.

Originally posted by StefanG:
I thought I’d never say this, but in this
particular respect, the precompilation of
HLSL does seem like a better platform for
more complicated expression optimisations.

Actually, HLSL precompilation is a problem. If HLSL just dumped raw unoptimized code many shaders would actually run faster as that would leave that work to the driver’s optimizer, which knows more of what’s optimal for the underlying hardware. When HLSL is trying to optimize, it often means the real intent of the original shader is hidden to the driver.

abcd is essentially a(b*(cd)).
Sorry for nitpicking . It’s ((a * b) * c) * d because "
" has left-to-right associativity.

A good example for GLSL user optimizations is this:
vector = Matrix * Matrix * vector; // slow
vector = Matrix * (Matrix * vector); // fast

See the different instruction count?
The first needs 20 the second only 8!

Originally posted by Relic:
Sorry for nitpicking . It’s ((a * b) * c) * d because “*” has left-to-right associativity.

Actually, HLSL precompilation is a problem. If HLSL just dumped raw unoptimized code many shaders would actually run faster as that would leave that work to the driver’s optimizer, which knows more of what’s optimal for the underlying hardware. When HLSL is trying to optimize, it often means the real intent of the original shader is hidden to the driver.
It probably detects the hw and does it’s best to optimize which should be enough.
I don’t really know but D3D may even flag the shader as beeing already optimized to the driver.

vector = Matrix * Matrix * vector; // slow
vector = Matrix * (Matrix * vector); // fast

The thought had crossed my mind. I assume the driver is or will be smart enough to reduce instructions.

Originally posted by V-man: