Greetings!
The problem in you code is rather trivial
(if one understands)! You do a lot of really
UNNECESSARY copy operations. In general,
parallel addition is faster then sequential
addition. Well, thats not the news. What I
mean is, the problem is not within ‘_mm_add_ps’.
It is the overhead you create everytime you
call your ‘add’ function.
The reason why your ‘add_sisd’ function is
faster is because of fewer copy operations.
Since you have written the function ‘set(…)’
directly in the return statement the compiler
can do a so-called ‘return value optimization’,
which means that the restult of ‘set(…)’
function is directly created in the return
value without any copying. Well, even further,
no temporaries are created in the function of
‘add_sisd’, if the function ‘set(…)’ does
really only an assignment (as you have told)
and is also inlined (to prevent from touching
the stack). So everything in ‘add_sisd’ happends
‘in-place’, because of the way you have outline
the code. Well, two things get copied (when you
call ‘add_sisd’) and these are the arguments for
the function. But this is the same overhead as
you have in ‘add’. However, this overhead is
also UNNECESSARY and can be removed.
As Cyranose said, memory alignment can be an
issue, but not really in you part of code.
Well, vc++6 does usually align to a good amount.
If you have doubts about alignment use
‘__declspec(align(16))’ in you variable
declaration, or check the assembler code to look
for it. However, copying 100% perfect unnecessary
aligned data does not give you anything, whether
aligned or not. But don’t get me wrong, alignment
is very important.
So what happends in your ‘add’ function?
Everytime you call this function you create the
temp variable ‘c’. Then you copy the result to
‘c.data’ and thereafter ‘c’ is copied to the
place where the return address is in memory. All
of this can be removed.
As a side note; You not only add in parallel with
the intrinsic ‘_mm_add_ps’, you implicitly move
the result also in parallel, unlike in ‘add_sisd’.
If you do a memory transfer within “SSE bounds”,
then the compiler will use the appropriate
copy/move instructions. In this case it is ‘movaps’.
So, the naked power, in this example, (for the
parallel version) comes from the following two
assembler instructions - ‘addps’ and ‘moveps’.
The deal is now to give ‘addps’ the raw source date
and to let ‘moveps’ copy the result (located in a
register) directly to the location in memory, where
your variable is living.
If this would not speed up the code, the entire
SSE unit would be useless!
What does an improved version look like? Here is a
trivial solution to the problem (for the parallel
version first).
inline void ADD_SIMD(const vector4d &a, const vector4d &b, vector4d &c)
{
c.data = _mm_add_ps(a.data, b.data);
}
Puhh, rather simple! This function does what I
have explaind before. What? You want a prove?
Ok, let us look on the assembler instruction
{
movaps -72(%ebp), %xmm1
addps -88(%ebp), %xmm1
movaps %xmm1, -104(%ebp)
}
Wow!
This is all of the function body of ‘ADD_SIMD’!
Using ‘ADD_SIMD’ is simple:
vector4d a,b,c;
ADD_SIMD(a, b, c);
‘c’ holds the result.
Of couse, this is somewhat different from
c = add(a, b)
but it is also possible to code a version
this way, if you want. The small “problem”
doing so comes from having different types
involved, since the return value should be
‘vector4d’ whereas ‘_mm_add_ps’ returns
‘__m128’.
Wanna see your ‘add’ function in assembler?
No? Sorry… 
{
movl -56(%ebp), %esi
movl -52(%ebp), %ecx
movl -48(%ebp), %edx
movl %esi, -120(%ebp)
movl %ecx, -116(%ebp)
movl %edx, -112(%ebp)
movl -44(%ebp), %esi
movl -72(%ebp), %ecx
movl -68(%ebp), %edx
movl %esi, -108(%ebp)
movl %ecx, -136(%ebp)
movl %edx, -132(%ebp)
movl -64(%ebp), %esi
movl -60(%ebp), %ecx
movaps -120(%ebp), %xmm1
movl %esi, -128(%ebp)
movl %ecx, -124(%ebp)
addps -136(%ebp), %xmm1
movaps %xmm1, -104(%ebp)
movl -104(%ebp), %edx
movl -100(%ebp), %esi
movl -96(%ebp), %ecx
movl %edx, -88(%ebp)
movl %esi, -84(%ebp)
movl %ecx, -80(%ebp)
movl -92(%ebp), %edx
movl %edx, -76(%ebp)
}
Puhhh! Can you see all da moves? Now look
back at the assembly code from ‘ADD_SIMD’.
Well, the 'movl’s above the first ‘movaps’
are the penalty for copying the paramters
of the function. The 'movl’s below the second
‘movaps’ are the penalty for copying the temp
vector ‘c’ to the memory location where the
result should be stored.
Lets look at your ‘add_sisd’ function. This
function could also trivialy be improved to
inline void ADD_SISD(const vector4d &a, const vector4d &b, vector4d &c)
{
c.elements[0] = a.elements[0] + b.elements[0];
c.elements[1] = a.elements[1] + b.elements[1];
c.elements[2] = a.elements[2] + b.elements[2];
c.elements[3] = a.elements[3] + b.elements[3];
}
Of couse, listing the elements is really rather
trivial, but also note that we have changed the
arguments (note the ‘&’). We now call by reference
and not by value, like we have done in ‘ADD_SIMD’.
Assembler code? For sure!
{
flds -88(%ebp)
flds -84(%ebp)
flds -80(%ebp)
flds -76(%ebp)
fxch %st(3)
fadds -72(%ebp)
fxch %st(2)
fadds -68(%ebp)
fxch %st(1)
fadds -64(%ebp)
fxch %st(3)
fadds -60(%ebp)
fxch %st(2)
fstps -104(%ebp)
fstps -100(%ebp)
fxch %st(1)
fstps -96(%ebp)
fstps -92(%ebp)
}
Well, sweet. Here you can see how the code is
processed in a sequentially manner. We have
sequentially loads, sequentially adds and
sequentially stores.
What about the speed?
I have clocked both functions, ADD_SIMD and
ADD_SISD, down to cycles.
Result:
‘ADD_SIMD’ takes ~ 16 cycles.
‘ADD_SISD’ takes ~ 28 cycles.
(cycles can vary a little due to the
OS. But these are the smallest I was
able measure.)
Speed up of ‘ADD_SIMD’: ~ 42% faster
Now lets do it with 2^20 = 1048576 calls in
a row!
2^20 ‘ADD_SIMD’ ~ 2097316 cycles
2^20 ‘ADD_SISD’ ~ 13720816 cycles
Hence, in this case, ‘ADD_SIMD’ is at least
6 TIMES FASTER as ‘ADD_SISD’! That’s alot!
For fun, lets measure the cycles from your
‘add’ function.
‘add’ takes ~ 164 cycles
2^20 ‘add’ ~ 105118568 cycles
The ‘add’ function is ABOUT 50 TIMES
SLOWER than ‘ADD_SIMD’, if you make 2^20
calls to ‘ADD_SIMD’ in a row, and ABOUT 10
TIMES SLOWER within a single call! Both are
using the simd ‘_mm_add_ps’ intrinsic
function.
Well, using ‘ADD_SIMD’ will give you at least
a 10-fold speed-up!
Make your choice…
I think, this should help. 
Notes:
Stay away from memory copying! Memory transfer
is a slow operation (like searching) because of the
large gap between cpu performance and memory
bandwidth. Suffling in an efficient way memory around
will be one of the main difficulties in the years to
come. Intel has allready put some stuff on that issue.
It is possible to use SSE (esp. SSE2) to make fast
memory transfer. Well, but small temps can also improve
your code! Small temps in connection with pointers can
remove what is known as ‘pointer aliasing’. This helps
the compiler in optimization.
Some comments on the system I used:
Hardware: PIV 2.4GHz 1GB PC2100 DDR-SDRAM
OS : Linux (kernel 2.4.*)
Compiler: gcc 3.3 [-march=pentium4 -msse2 -s -O3 -fomit-frame-pointer]
cu,
m i s s i l e
[This message has been edited by m i s s i l e (edited 01-25-2004).]