Fast Sin/Cos operations

I need really fast sin/cos functions for usage in my 3d engine. The builtin functions are nice, but they aren’t fast enough. Intel’s library is centric towards their own processors and compiler.

What I need is a darn fast table lookup, preferably in a class. If need be, I’ll write one myself, but I’d like to avoid that if at all possible (because I’m apt to make it thread safe and feature laden). Is there a lib/class out there that has fully optimized sin/cos funcitons?


A lookup table is only fast if its always resident in the 1st level cache. If there is a cache miss you lose lots of clockcycles.
Additionaly I think its a waste of valuable cache memory to store precalculated tables (except for REALLY expensive operations).

I prefer the good old “__asm fsincos” it just takes around 120 clockcycles to calculate the sin & cos for any given angle.

I once tested how i can optimize sin/cos calculations.

I precalulated a table for 0 to 359 degree (both float and double). Then i tested how fast i can get the data (i made a loop, which for 10 seconds accessed one value and counted how often i could do this).
Then i did the same, but instead of accessing the array, i just used the sine/cosine functions of math.h.

Well the result was astonishing. The functions were up to 5% faster.

I couldn´t believe this and checked if there could be anything to distort the test, but i always got the same results.

So now i use the functions, because they are faster and i can use floating point values as degrees.

I don´t know, maybe one can speed up this with assembler a bit, but you won´t be able to speed it up with a lookup table.

And the reason i tested it was, because i read an article of someone who knows more about it then i do, and he claimed that function calls are faster than memory access. I tried to prove him wrong, but i couldn´t.

I have an Athlon 1.3 GHZ, 512 MB DDR RAM with a good Asus Motherboard. So if one has this new RAM (which the Pentium 4 requires) (i forgot the name), than memory acces might be faster.

Hope that helps you.

Benchmarking a lookup table in a for loop is not very good. When constantly accessing it like that, the table is very likely to alway be in the cache. In a real life situation, that is not the case, and you will have lots of expensive cache misses. So in a real life situation, a lookup table should be even slower than in your for-loop test.

Also, did you run the program with full optimization and no debug information? With debug information, a benchmark test means nothing.

Yes, i know that in a for loop it should be even faster (which shows, that a lookup table is even worse in real life, than in my test).

And yes, i ran it in release mode, with no debug stuff.