my SSE code is slower than normal. why?


I would like to ask somebody to tell me, why is my SSE code for vector and matrix operations slower. I must do something wrong, but dont know what. My vector addition for example:
(im using intrinsics, xmmintirn.h for vc++6 sp5)

data type:

typedef union
__m128 data;
float elements[4];
} vector4d;

with SSE (using intrinsics):

inline vector4d add(vector4d a, vector4d b)
vector4d c; = _mm_add_ps(,;
return c;


inline vector4d add_sisd(vector4d a, vector4d b)
return set(a.elements[0]+b.elements[0],a.elements[1]+b.elements[1],a.elements[2]+b.elements[2],0);

(set() simply returns a vector4d with the parameter

Originally posted by orbano:

I would like to ask somebody to tell me, why is my SSE code for vector and matrix operations slower. [/b]

a) How did you do your timing? How much slower would give a clue as to what might be wrong.

b) the SSE functions are slow when using unaligned data. the _mm128 data type should implicitly align your union, but I’d double-check that. You may need a declspec align token to help the compiler realize what you’re doing.

c) In your SSE function, you create a temporary vector (plus the input vectors aren’t const references), which involves memory allocation, if only on the stack and a lot of reading/writing using non SSE instructions. Optimized math will generally be of the form { a=b; a+=c; } rather than { a = b + c; }

More imporantly, to get the benefits of SSE, it’s best to write the code so that the compiler can leave as much of your data in SSE registers for as long as possible to avoid any copies to main memory.

The compiler’s optimization pass should be able group multiple SSE instructions that wind up near each other this way (check optimization level), but it’s not guaranteed.

The only guarantted way, IMO, is to use the inline assembly SSE where you manage the registers yourself. That’s more useful for more complex math operations, but that’s where you’ll see the biggest benefit.

Hope that helps


[This message has been edited by Cyranose (edited 01-15-2004).]


The problem in you code is rather trivial
(if one understands)! You do a lot of really
UNNECESSARY copy operations. In general,
parallel addition is faster then sequential
addition. Well, thats not the news. What I
mean is, the problem is not within ‘_mm_add_ps’.
It is the overhead you create everytime you
call your ‘add’ function.

The reason why your ‘add_sisd’ function is
faster is because of fewer copy operations.
Since you have written the function ‘set(…)’
directly in the return statement the compiler
can do a so-called ‘return value optimization’,
which means that the restult of ‘set(…)’
function is directly created in the return
value without any copying. Well, even further,
no temporaries are created in the function of
‘add_sisd’, if the function ‘set(…)’ does
really only an assignment (as you have told)
and is also inlined (to prevent from touching
the stack). So everything in ‘add_sisd’ happends
‘in-place’, because of the way you have outline
the code. Well, two things get copied (when you
call ‘add_sisd’) and these are the arguments for
the function. But this is the same overhead as
you have in ‘add’. However, this overhead is
also UNNECESSARY and can be removed.

As Cyranose said, memory alignment can be an
issue, but not really in you part of code.
Well, vc++6 does usually align to a good amount.
If you have doubts about alignment use
‘__declspec(align(16))’ in you variable
declaration, or check the assembler code to look
for it. However, copying 100% perfect unnecessary
aligned data does not give you anything, whether
aligned or not. But don’t get me wrong, alignment
is very important.

So what happends in your ‘add’ function?
Everytime you call this function you create the
temp variable ‘c’. Then you copy the result to
‘’ and thereafter ‘c’ is copied to the
place where the return address is in memory. All
of this can be removed.

As a side note; You not only add in parallel with
the intrinsic ‘_mm_add_ps’, you implicitly move
the result also in parallel, unlike in ‘add_sisd’.
If you do a memory transfer within “SSE bounds”,
then the compiler will use the appropriate
copy/move instructions. In this case it is ‘movaps’.

So, the naked power, in this example, (for the
parallel version) comes from the following two
assembler instructions - ‘addps’ and ‘moveps’.
The deal is now to give ‘addps’ the raw source date
and to let ‘moveps’ copy the result (located in a
register) directly to the location in memory, where
your variable is living.

If this would not speed up the code, the entire
SSE unit would be useless!

What does an improved version look like? Here is a
trivial solution to the problem (for the parallel
version first).

inline void ADD_SIMD(const vector4d &a, const vector4d &b, vector4d &c)
{ = _mm_add_ps(,;

Puhh, rather simple! This function does what I
have explaind before. What? You want a prove?
Ok, let us look on the assembler instruction

movaps -72(%ebp), %xmm1
addps -88(%ebp), %xmm1
movaps %xmm1, -104(%ebp)

This is all of the function body of ‘ADD_SIMD’!

Using ‘ADD_SIMD’ is simple:

vector4d a,b,c;
ADD_SIMD(a, b, c);

‘c’ holds the result.

Of couse, this is somewhat different from

c = add(a, b)

but it is also possible to code a version
this way, if you want. The small “problem”
doing so comes from having different types
involved, since the return value should be
‘vector4d’ whereas ‘_mm_add_ps’ returns

Wanna see your ‘add’ function in assembler?
No? Sorry…

movl -56(%ebp), %esi
movl -52(%ebp), %ecx
movl -48(%ebp), %edx
movl %esi, -120(%ebp)
movl %ecx, -116(%ebp)
movl %edx, -112(%ebp)
movl -44(%ebp), %esi
movl -72(%ebp), %ecx
movl -68(%ebp), %edx
movl %esi, -108(%ebp)
movl %ecx, -136(%ebp)
movl %edx, -132(%ebp)
movl -64(%ebp), %esi
movl -60(%ebp), %ecx
movaps -120(%ebp), %xmm1
movl %esi, -128(%ebp)
movl %ecx, -124(%ebp)
addps -136(%ebp), %xmm1
movaps %xmm1, -104(%ebp)
movl -104(%ebp), %edx
movl -100(%ebp), %esi
movl -96(%ebp), %ecx
movl %edx, -88(%ebp)
movl %esi, -84(%ebp)
movl %ecx, -80(%ebp)
movl -92(%ebp), %edx
movl %edx, -76(%ebp)

Puhhh! Can you see all da moves? Now look
back at the assembly code from ‘ADD_SIMD’.

Well, the 'movl’s above the first ‘movaps’
are the penalty for copying the paramters
of the function. The 'movl’s below the second
‘movaps’ are the penalty for copying the temp
vector ‘c’ to the memory location where the
result should be stored.

Lets look at your ‘add_sisd’ function. This
function could also trivialy be improved to

inline void ADD_SISD(const vector4d &a, const vector4d &b, vector4d &c)
c.elements[0] = a.elements[0] + b.elements[0];
c.elements[1] = a.elements[1] + b.elements[1];
c.elements[2] = a.elements[2] + b.elements[2];
c.elements[3] = a.elements[3] + b.elements[3];

Of couse, listing the elements is really rather
trivial, but also note that we have changed the
arguments (note the ‘&’). We now call by reference
and not by value, like we have done in ‘ADD_SIMD’.

Assembler code? For sure!

flds -88(%ebp)
flds -84(%ebp)
flds -80(%ebp)
flds -76(%ebp)
fxch %st(3)
fadds -72(%ebp)
fxch %st(2)
fadds -68(%ebp)
fxch %st(1)
fadds -64(%ebp)
fxch %st(3)
fadds -60(%ebp)
fxch %st(2)
fstps -104(%ebp)
fstps -100(%ebp)
fxch %st(1)
fstps -96(%ebp)
fstps -92(%ebp)

Well, sweet. Here you can see how the code is
processed in a sequentially manner. We have
sequentially loads, sequentially adds and
sequentially stores.

What about the speed?

I have clocked both functions, ADD_SIMD and
ADD_SISD, down to cycles.

‘ADD_SIMD’ takes ~ 16 cycles.
‘ADD_SISD’ takes ~ 28 cycles.

(cycles can vary a little due to the
OS. But these are the smallest I was
able measure.)

Speed up of ‘ADD_SIMD’: ~ 42% faster

Now lets do it with 2^20 = 1048576 calls in
a row!

2^20 ‘ADD_SIMD’ ~ 2097316 cycles
2^20 ‘ADD_SISD’ ~ 13720816 cycles

Hence, in this case, ‘ADD_SIMD’ is at least
6 TIMES FASTER as ‘ADD_SISD’! That’s alot!

For fun, lets measure the cycles from your
‘add’ function.

‘add’ takes ~ 164 cycles

2^20 ‘add’ ~ 105118568 cycles

The ‘add’ function is ABOUT 50 TIMES
SLOWER than ‘ADD_SIMD’, if you make 2^20
calls to ‘ADD_SIMD’ in a row, and ABOUT 10
TIMES SLOWER within a single call! Both are
using the simd ‘_mm_add_ps’ intrinsic

Well, using ‘ADD_SIMD’ will give you at least
a 10-fold speed-up!

Make your choice…

I think, this should help.

Stay away from memory copying! Memory transfer
is a slow operation (like searching) because of the
large gap between cpu performance and memory
bandwidth. Suffling in an efficient way memory around
will be one of the main difficulties in the years to
come. Intel has allready put some stuff on that issue.
It is possible to use SSE (esp. SSE2) to make fast
memory transfer. Well, but small temps can also improve
your code! Small temps in connection with pointers can
remove what is known as ‘pointer aliasing’. This helps
the compiler in optimization.

Some comments on the system I used:
Hardware: PIV 2.4GHz 1GB PC2100 DDR-SDRAM
OS : Linux (kernel 2.4.*)
Compiler: gcc 3.3 [-march=pentium4 -msse2 -s -O3 -fomit-frame-pointer]

m i s s i l e

[This message has been edited by m i s s i l e (edited 01-25-2004).]

THANK YOU! This is the best answer i have ever got! If i could, i would thank you it in a verse…

The only thing i dont yet know is the vector dot product. it is also slower thas the sisd counterpart. the code is the following:

inline float Dot_SIMD(const TVector4d &a, const TVector4d &b)
TVector4d c;,;
return (c.v[0]+c.v[1]+c.v[2]+c.v[3]);

Do you have any ideas?

Originally posted by orbano:
THANK YOU! This is the best answer i have ever got! If i could, i would thank you it in a verse…

Well, for a 10-fold speed-up people
would kill.

m i s s i l e

Originally posted by orbano:
[b]The only thing i dont yet know is the vector dot product. it is also slower thas the sisd counterpart. the code is the following:

inline float Dot_SIMD(const TVector4d &a, const TVector4d &b)
TVector4d c;,;
return (c.v[0]+c.v[1]+c.v[2]+c.v[3]);

Do you have any ideas?[/b]

First off all, you missed the definition of the
scalar-product (dot product). But I guess it is a
typo in your post, since we multiply the components
of the two vectors and then add the result together
to get the dot. Hence, using _mm_mul_ps instead of

inline float Dot_SIMD(const TVector4d &a, const TVector4d &b)
TVector4d c;,;
return (c.v[0]+c.v[1]+c.v[2]+c.v[3]);

Speeding up?

Well, lets analyse a bit. The _mm_mul_ps intrinsic
is ok. But what about this sequential addition? Of
cause, this is where a lot of people get stucked,
since there is no (what is known as a) “horizontal”
addition (for __m128 data types) within the SSE2 unit.
This is one of the drawback of the SSE2 unit whereas
3DNOW has such an instruction, called ‘pfacc’. Well,
the problem will be solved with the upcoming next Intel
chip, called Prescott. This chip will utilizes
horizontal instructions.

However, there is minor versions of such horizontal
instructions in SSE2. You can multiply two __m128
together and then adding the low and higher part using
some annoying (its up to your knowledge) shuffling

Wanna have a faster version? Lets look at this

inline void DOT_SIMD(const vector4d &a, const vector4d &b, float &dot)
__m128 r;

r = _mm_mul_ps(,;
r = _mm_add_ps(_mm_movehl_ps(r, r), r);
r = _mm_add_ss(_mm_shuffle_ps(r, r, 1), r) ;

_mm_store_ss(&dot, r);

Read the manuals about ‘_mm_shuffle_ps’ and ‘_mm_movehl_ps’!
As allways, read as much as you can and play around with it.
This is the only way you can improve you skills, esp. on such
a topic. Using intrinsic’s is very close to the metal. On
this stage you must already unterstand some assembler
coding, otherwise you will blow your head off. Intrinsic’s
are a thin layer above the raw assembly instructions. Most
usefull to those who already have experience with SSE2 at
the native level. In this way using intrinsic’s is easier
and also more portable (to newer Intel microprocessors).
So, the intrinsic’s should NOT be unterstood as an easy way
for programers who are not into assembler and things alike.

If you really care for speed, then you have to go down
learn some concepts on the assembly level and come back
up again. It will definitely payoff! Even further, if new
assembly instructions appear you can trivially use them
to your advance where most compilers get stucked! Well, if
you don’t want effort the time, then it is better to use
a good compiler on such a topic. In this case, the Intel
icc C++ compiler is one such compilers (if not the best).
This compiler has an automatic vectorizer implemented.
Using this compiler would automaticly transform the sequence
(c.v[0]+c.v[1]+c.v[2]+c.v[3]) in a vectorized form. However,
a very nice combination is in having knowledge on the lower
level and a powerful compiler aside. In this way, you can
give the compiler the right hints and he can optimize the
hell out of your code! For example, you can give the gcc
compiler a hint to generate intructions for prefetching
(in advance!) a block of data into the cache! Placed at the
right position and your code must step on the breaks to not
cross the lightspeed barrier. Placed at the wrong position
and you will trash your cache and lose speed. Its for you
to decide.

Ok, now back to the speed test. I will measure
‘Dot_SIMD’ (your function), DOT_SISD and ‘DOT_SIMD’.

For completeness, here is the code for ‘DOT_SISD’:

inline float DOT_SISD(const vector4d &a, const vector4d &b)
return a.elements[0] * b.elements[0] + a.elements[1] * b.elements[1] +
a.elements[2] * b.elements[2] + a.elements[3] * b.elements[3];


1 ‘DOT_SISD’ ~ 84 cycles
1 ‘Dot_SIMD’ ~ 88 cycles
1 ‘DOT_SIMD’ ~ 52 cycles

Well, the speed-up of ‘DOT_SIMD’ against ‘Dot_SIMD’
is about 70% (88 / 52 ~= 1.70)

These timings are from my machine using gcc 3.3 with
options -march=pentium4 -msse2 -s -O3 -fomit-frame-pointer.

The performance of DOT_SIMD also depends on how your vectors
flow around in the system. It is very easy to out-perform
‘DOT_SIMD’ with ‘DOT_SISD’, for example, if your data is not
aligned. So, take care and also watch your compiler settings.

m i s s i l e

[This message has been edited by m i s s i l e (edited 01-25-2004).]