This I did not know. I tried beating my pascal compiler (don’t really like C/C++; it’s good enough for my work but for my private projects I’d rather use a language I like) by creating the SSE code myself and failed miserably.[/QUOTE]
Alignment plays a part, although I have not seen particularly massive increases between using aligned and unaligned x86 commands.
There are also times when standard (non-packed) commands work out quicker. gcc will compile four successive float adds as just that, and even inline aligned blocks of asm and data using vector adds will not beat it on my Core Duo. HADDPS is another command that can be beaten with four instructions of unpacked shift and adds.
But even with a function call overhead doing vector Floors, or vector Cubics and so on you can beat an inlined c function by a factor of 3 I have found.
You do need to benchmark, and yes offsets to data, page switches etc. can all affect it. So it’s something to do once a library or project is nearing completion so you can splice things together neatly.
The golden rule is to get data into the CPU as quickly and efficiently as possible, and then do lots of internal instructions, then get it back out as quickly and efficiently as possible. Although even then there are strange pitfalls. I have found that moving 4 floats on occasion can be quicker than moving one float out of a 4 wide register! Go figure!
Then when you think you have it all under control you learn about register and memory clobbering!
Any routine I write I benchmark against my best C++ over billions of iterations, and try both inline and function versions, and also change it’s location and context within the benchmark app. But for a 50% increase in the speed of your math library it is certainly worth it.
Bears out what I was saying above.
Again, I will look at asm for the GPU once my shaders are complete. I would not want to prototype in asm there. c is much easier for that.