I’m wondering if it is still worth it to off load this from the CPU. Sure, FPU’s can do sine and cosine today faster than a FPU from ten years ago could add two numbers, but still, would there be any performance gain by pre-computing sine values? Would the table look up outweigh the computation?
FSIN on a Core 2 processor takes between 35 and 85 clock cycles, and FCOS is even slower (70 to 100).
A table lookup has a latency of about 3 clock cycles and a single core can do 3 of these at the same time, so it is still way faster to do a table lookup.
Its only worth doing this if you are doing a large number of sine calculations, and preferably when you dont need a lot of accuracy (so the table can be as small as possible).
If you only use the table occasionally and it gets evicted from the processor cache’s, then the cache miss can take longer than the FSIN would have.
I have done a few tests about 5 years ago and i read from others who tested it too, and the conclusion was, that lookup-tables are actually slower.
For every lookup you need to add some overhead for rounding the number to the nearest value, that you know, and if you don’t want to waste too much memory, you don’t store the full 360 degree, but reconstruct the exact value, so that adds a few if’s and instructions, too. That means 3 cycles is definitely not possible, but you certainly will have much less than 80.
Only that memory lookups can be very costly. If you do A LOT of lookups in a tight loop, it might be faster, but if you do it only “once in a while” it’s quite probable, that the CPU first has to wait for the memory to be loaded into the cache. THAT can take thousands of cycles.
I have abondened the idea of lookup tables for such things. The missing precision will byte you at some time later anyway.
I was able to get a substantial speed improvement for the following specific conditions:
1/ Precision was not a problem because i was doing a simple X = Sin(A)*X, Y = Cos(A)*Y requiring low accuracy, and errors did not accumulate.
Hence i just used 16-bit fixed-point values in the table and a simple integer multiply instead of floating point math.
2/ All angles were in units of 1/256 of a rotation so i could simply add angles and byte arithmatic would wraparound without having to worry about under/overflow.
This also allowed the same 512 byte table to be used for cosine & sine simply by adding 64 to the angle.
3/ The sine table was prefetched to the L1 cache before it was needed, and was aligned to a 512 byte boundary to ensure its packed into the minimum number of cache lines and did not cross a 4k page boundary.
It should also be noted that FSIN can only be executed on one of the execution units on modern processors with multiple execution units per core, and it uses a microcode routine so it cant pipeline multiple instructions into that execution unit.
This means that two FSIN instructions need to be separated by about 100 simple instructions, otherwise all of the other execution units in that core will stall until the FSIN completes, reducing the total instruction throughput to about 1/4 of what it should be.
I need to calculate sine and cosine several times in each frame for rotations and camera movements. I haven’t counted how many times, but I’m not creating a regular polygon once in the code and that’s it.
Resolution is not much of an issue. 1/16 of a degree would be more than sufficient.
if it’s not dozens of thousands of times, then why worry? Besides, in the camera calculations you should use the highest precision possible.
Just as Simon wrote, with proper prefetch and memory alignment you can sequence instructions that get nicely pipelined. OTOH you can also sequence fsincos/fsin-using C++ code to mask the latency.
360 degrees * 16 * 4 bytes for a float is 23kB, thats a very big table.
A Core 2 Duo E6300 has an L1 data cache of only 32kB, so the table would have to be in L2 which inceases latency.
You definately dont want to be reading 23kB from main memory too often, so it would have to sit in the L2 cache all the time.
But this means 2% of a 1MB cache is not available to the rest of the program, creating more cache misses which slows down the rest of the program.
Hence it would only be faster if you were doing enough sine’s so that the saving in execution clock cycles was greater than the cycles lost in cache misses.
The only way to be sure is to try both ways and measure which method is faster for your particular application, but you would probably need to be doing hundreds of sine’s, preferably all together in one inner loop that executes once per frame.
If you use FSIN then DONT do them all at once, try to spread them out by doing as much other work as possible between each FSIN call.
Actually, it would be 90 * 16 * 4, so a little over 5kb, but I see your point.
Thanks for the infomation. I wont be doing anywhere near enough sines or cosines to justify this.
How about a resonant filter?
For camera rotations why not use quaternions they give great precision and speedup from the small number of evaluations