HW implemetation of trigonometry

Can anyone give a reference to some official document about trigonometric function implementation in modern GPUs?

There are some unofficial claims on the net (forums at first place), but I need some firm document to cite.

I have already carried out experiments and the implementations differ from vendor to vendor.
Currently, Intel has the most precise trigonometry, then NVIDIA, and AMD is definitely the worst.
Probably the density of lookup tables differ (if the implementations use lookup tables at all).

For my purposes I had to reimplement all trigonometric functions in GLSL. That solution is certainly slower (since both sine and cosine are executed in a single cycle of the clock), but the overall performances are almost the same (since shaders are not trivial) and the precision is raised incredibly.

Thank you for the help!

Hi,

I have make a test about linear interpolation + renormalization of cos/sin with a table of 360 steps and this seem to always give an approximed value very near to the true

``````
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

float Cosinus_table[361];
float Sinus_table[361];

#define M_PI 3.14159265358979323846

#define STEPS 1000

void InitTables()
{
int i;

for( i = 0 ; i < 361 ; i++ )
{
radians = ( (float)(i) * M_PI ) / 180.0f;
printf("SinCos[%d]=(%f, %f)
",
i, Sinus_table[i], Cosinus_table[i] );
}
}

void Renormalize( float *x, float *y)
{
float nx, ny, dist;

nx = *x;
ny = *y;

dist = sqrt(nx*nx + ny*ny);

nx /= dist;
ny /= dist;

*x = nx;
*y = ny;
}

{
float degrees, interp, c0, c1;

degrees = fmod(radians * (180.0f / M_PI), 360.0f);

c0 = Cosinus_table[(int)degrees];
c1 = Cosinus_table[(int)(degrees+1)];

interp = degrees - floor(degrees);

return c1 * interp + c0 * (1.0f - interp);
}

{
float degrees, interp, s0, s1;

degrees = fmod(radians * (180.0f / M_PI), 360.0f);

s0 = Sinus_table[(int)degrees];
s1 = Sinus_table[(int)(degrees+1)];

interp = degrees - floor(degrees);

return s1 * interp + s0 * (1.0f - interp);
}

int main( int argc , char **argv)
{
int i;
float radians, sreal, creal, sfast, cfast;
float sdiff, cdiff,  mindiff, maxdiff;

maxdiff = 0.0f;
mindiff = 1.0f;

InitTables();

for( i = 0 ; i < 360 * STEPS ; i++)
{
radians = ((float)(i) * M_PI) / (180.0f * STEPS);

Renormalize(&cfast, &sfast);

sdiff = sreal - sfast;
cdiff = creal - cfast;

//printf("step %d (%f degrees, %f radians) : (sin,cos)=(%f,%f) (fsin,fcos)=(%f,%f) diff=(%f,%f)
",
//    i, (float)(i)/(float)STEPS, radians, sreal, creal, sfast, cfast, sdiff, cdiff);

if ( cdiff > maxdiff) maxdiff = cdiff;
if ( cdiff < mindiff) mindiff = cdiff;
if ( sdiff > maxdiff) maxdiff = sdiff;
if ( sdiff < mindiff) mindiff = sdiff;
}

printf("mindiff=%f maxdiff=%f
", mindiff, maxdiff);
}

``````

With 360 000 steps (something like an angular step of 0,00001746 radians), this give only mindiff=-0.000001 maxdiff=0.000001
=> the interpolation + renormalization seem to make a relatively precise sin/cos reconstruction for intermediates values
(without the Renormalize() step, this give mindiff=-0.000038 maxdiff=0.000038 => in this case, the approximation is 38x more bad :()
[but the computation is certainly relatively more speed on the other side :)]

Note that I have only test angle between 0 and 360 degrees but I think that this work too with biggers or négatives values because the use of fmod() function
(internaly, this use degrees for to be “a little more table friendly” but externaly we have always to use radians values for angles)

PS : I don’t know is my fastSinus() and fastCosinus() are really as fast as this because I have only make them for to test the idea

PS2 : I have too find this http://lab.polygonal.de/?p=205 and this http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/ but, as for my example, it’s only a software approximation of sinus/cosinus, not a faster GPU hardware implementation scheme of sin/cos as you seem to want => you want to find infos about how is implemented FSINCOS for example, it’s good this ?

I have found at http://blog.makezine.com/2008/12/04/quakes-fast-inverse-square-roo/ how to compute rapidly the Inverse Square Root

``````
float InvSqrt(float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // gives initial guess y0
x = *(float*)&i; // convert bits back to float
x = x*(1.5f-xhalf*x*x); // Newton step, repeating increases accuracy

return x;
}

``````

This can to be used into my Renormalize() func for to accellerate it for exemple
(this can eliminate the sqrt() func ands replace the two divisions by a multiplication of the InvSqrt() return value)

On the other side, I think about using the GPU texture hardware for to handle the interpolation part and using a paletted texture
for to remplace the Sinus_table[] and Cosinus_table[] association array, limited to 256 entry instead 360 because paletted texture can only handle 256 colors/entries
(but this can perhaps really to be less fast than to simply use actuals RCP, RSQ or others EXP shader’s intructions ?)

In all cases, using Taylor/MacLaurin, Lauren/Newton or likes series/steps expansions seem to be generally a good solution for a lot of fast trigonometrics approximations

But it seem than hardware implementations use differents algorithms and I have found too links that speak about CORDIC and BKM here http://en.wikipedia.org/wiki/Cordic and here http://en.wikipedia.org/wiki/BKM_algorithm
=> they seem specially adapted for FPGAs because they don’t use multipliers

Here is a PDF link that explain how to use CORDIC into FPGAs http://www.andraka.com/files/crdcsrvy.pdf
(but what is explained into this PDF seem to be relatively complex to understand => after one or two days of reading, I could perhaps begin to understand what it is explained in it )
[this link http://web.archive.org/web/20000823135114/http://www.ezcomm.com/~cyliax/Articles/RobNav/sidebar.html seem to be very more simple to understand => I think to use it for to begin to understand what is really the CORDIC algorithm :)]

I have found one CORDIC software implementation at this link http://www.dcs.gla.ac.uk/~jhw/cordic/

I have used the 32 bits version of the header (that I have named corbic32.h)

``````
//Cordic in 32 bit signed fixed point math
//Function is valid for arguments in range -pi/2 -- pi/2
//for values pi/2--pi: value = half_pi-(theta-half_pi) and similarly for values -pi---pi/2
//
// 1.0 = 1073741824
// 1/k = 0.6072529350088812561694
// pi = 3.1415926536897932384626
//Constants
#define cordic_1K 0x26DD3B6A
#define half_pi 0x6487ED51
#define MUL 1073741824.000000
#define CORDIC_NTAB 32
int cordic_ctab [] = {0x3243F6A8, 0x1DAC6705, 0x0FADBAFC, 0x07F56EA6, 0x03FEAB76, 0x01FFD55B,
0x00FFFAAA, 0x007FFF55, 0x003FFFEA, 0x001FFFFD, 0x000FFFFF, 0x0007FFFF, 0x0003FFFF,
0x0001FFFF, 0x0000FFFF, 0x00007FFF, 0x00003FFF, 0x00001FFF, 0x00000FFF, 0x000007FF,
0x000003FF, 0x000001FF, 0x000000FF, 0x0000007F, 0x0000003F, 0x0000001F, 0x0000000F,
0x00000008, 0x00000004, 0x00000002, 0x00000001, 0x00000000, };

void cordic(int theta, int *s, int *c, int n)
{
int k, d, tx, ty, tz;
int x=cordic_1K,y=0,z=theta;
n = (n>CORDIC_NTAB) ? CORDIC_NTAB : n;
for (k=0; k<n; ++k)
{
d = z>>31;
//get sign. for other architectures, you might want to use the more portable version
//d = z>=0 ? 0 : -1;
tx = x - (((y>>k) ^ d) - d);
ty = y + (((x>>k) ^ d) - d);
tz = z - ((cordic_ctab[k] ^ d) - d);
x = tx; y = ty; z = tz;
}
*c = x; *s = y;
}

``````

(warning : the values theta, c and s values are used in a fixed point MUL format)

And I have only make some minors modifications to the .c part for to see how the approximation is big

``````
#include "cordic32.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h> // for testing only!

#define STEPS 360
#define ITERS 32

//Print out sin(x) vs fp CORDIC sin(x)
int main(int argc, char **argv)
{
int s, c;
float sinus,cosinus;
int i;

for(i=0;i<STEPS;i++)
{
deg = ((float)(i)/(float)(STEPS)) * 90.0f;
rad = ((float)(i)/(float)(STEPS)) * M_PI / 2.0f;

sinus = s / MUL;
cosinus = c / MUL;

//these values should be nearly equal
printf("Step %i (%f degrees, %f radians) : cordic=(%f,%f) vs (sin,cos)=(%f,%f) => diff=(%f,%f))
",
}
}

``````

This give only cos/sin values between 0 to PI/2 radians (cf. 0 to 90 degrees in my french country :)), so we have to use quadrants for to handle all the 0 to 2 PI range of the circle

And the approximation seem near perfect because this give always the sames values, cf. diff is always (0,0)

I think to benchmark it the next week because I have a big doubt about the fact that this cordic function can have a chance to be to really more speed than the sin() or cos() implementation on my Linux box …
(I can’t make/test a true hardware implementation of it because I’m not an hardware engineer so don’t know how, and don’t have the material too, for to use/test/devellop something on a FPGA)

(this is in french language)

But me too, I don’t find any documentation about “trigonometric function implementation in modern GPUs” on the Net

Thank you, Little Body, for so many examples!

I don’t need an implementation. I already have very precise one.
It is a combination of Taylor series, built-in functions and trigonometric identities.
For my application it is not important to be extremely fast, but extremely precise.

What I’m looking for is HW implementation that I can use for a citation.

CORDIC is a very interesting approach, but I really doubt it is used in modern GPU since it is an iterative method. That way they cannot achieve one cycle execution, or I missed something.

Thus far, I have discovered that NV implementation has 1.07288363581976e-5 step in sine calculation. It would require 8M-entry table for such precision, which I really doubt can be afforded. On the other hand, it is a step-function.

Did you try to implement your method in GLSL? What precision do you get?
Don’t compare with float versions of sin/cos from the math lib. They are far less accurate than its double counterparts.
What do you get for sin(0.5e-5), or smaller arguments? 0.5e-5 degrees (not radians)! Or 1e-6? HW versions retrieve always 0, which is a 100% wrong value. That’s what I’m talking about.

I have only begin to study about how is computed trigonometrics functions after your post, this is only my firsts tests

I have effectively read that they are problems with very little values
=> perhaps that using something like a pseudo-logarithmic scale (or hyperbolic ?) for to index the association array, where two consecutives entries was used for to make the interpolation (linear for the instant, but this can extended with an interpolation sheme that use more than two entries for to be more precise) + renormalisation, with [a lot ?] more values at the beginning that at the end can resolve a part of this problem ?
(but I have doubt about the fact that this can be enough precise for to work with doubles :()
[are double version of sin/cos/atan and other trigonometric functions, on Linux and C/C++, enough precises for to serve as a reference ?]

Or replace the renormalisation step by a refinement step ???
(or add a refinenement step instead to only replace the renormalization step ?)

I think to implement/test this in GLSL after to have find something that is enough precise (and speed if possible) with a standard C code
(and I have test nothing with doubles, only with float … but this is something that I have planned to make the next week)

Me too, I think that only a very precise implementation can really to be interesting to improve in GLSL, due to the fact that they seem already very fast on somes GLSL implementations (make better than 1 cycle seem me very hard :)) BUT that they don’t seem to be very precise
=> improve the precision seem to me the principal thing that can now to be added to them in GLSL

I think too test/improve the code from this link http://lab.polygonal.de/?p=205

I have make a little benchmark between standard C/C++ float and double versions of sin()/cos() and an adapted version of the code from http://lab.polygonal.de/?p=205

``````
#include <chrono>
#include <iostream>
#include <unistd.h>
#include <time.h>
#include <math.h>

#define PRECISION double

// Chronometer functions
using Clock = std::chrono::high_resolution_clock;
using std::chrono::milliseconds;
using std::chrono::nanoseconds;
using std::chrono::duration_cast;

auto start = Clock::now();
auto end = Clock::now();
auto delai = duration_cast<milliseconds>(end-start).count();

inline void BeginTimer()
{
start = Clock::now();
}

inline void EndTimer()
{
end = Clock::now();
}

void PrintTimer(char *title)
{
// std::cout << duration_cast<nanoseconds>(end-start).count() << " ns
";
std::cout << title << " : "<< duration_cast<milliseconds>(end-start).count() << " ms
";
}

// Fast approximation functions
inline PRECISION fast_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

//compute sine
if (x < 0)
sin = 1.27323954 * x + .405284735 * x * x;
else
sin = 1.27323954 * x - 0.405284735 * x * x;

return sin;
}

inline PRECISION fast_cos( PRECISION x)
{
PRECISION cos;

//always wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

if (x >  3.14159265)
x -= 6.28318531;

if (x < 0)
cos = 1.27323954 * x + 0.405284735 * x * x;
else
cos = 1.27323954 * x - 0.405284735 * x * x;

return cos;
}

// Fast/ and precises approximation functions
inline PRECISION fast_precise_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

//compute sine
if (x < 0)
{
sin = 1.27323954 * x + .405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}
else
{
sin = 1.27323954 * x - 0.405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}

return sin;
}

inline PRECISION fast_precise_cos( PRECISION x)
{
PRECISION cos;

//always wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

if (x >  3.14159265)
x -= 6.28318531;

if (x < 0)
{
cos = 1.27323954 * x + 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}
else
{
cos = 1.27323954 * x - 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}

return cos;
}

// Tests standards functions
void TestStandardSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
sinus    = sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestStandardCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
cosinus    = cos(val);
}

EndTimer();

PrintTimer(title);
}

// Tests fast functions
void TestFastSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
sinus    = fast_sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestFastCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
cosinus    = fast_cos(val);
}

EndTimer();

PrintTimer(title);
}

// Tests fast and precise functions
void TestFastPreciseSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
sinus    = fast_precise_sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestFastPreciseCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
cosinus    = fast_precise_cos(val);
}

EndTimer();

PrintTimer(title);
}

// Tests ultra fast versions
void TestUltraFastSinus(char *title)
{
int i;
PRECISION val, sinus, x;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
x = val;

if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

//compute sine
if (x < 0)
sinus = 1.27323954 * x + .405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);
}

void TestUltraFastCosinus(char *title)
{
int i;
PRECISION val, cosinus, x;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;

//always wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

if (x < 0)
cosinus = 1.27323954 * x + 0.405284735 * x * x;
else
cosinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);
}

// main
int main()
{
TestStandardSinus(     "Standard    C/C++ sin() func  ");
TestStandardCosinus(   "Standard    C/C++ cos() func  ");

TestFastPreciseSinus(  "FastPrecise C/C++ sin() func");
TestFastPreciseCosinus("FastPrecise C/C++ cos() func");

TestFastSinus(         "Fast        C/C++ sin() func");
TestFastCosinus(       "Fast        C/C++ cos() func");

TestUltraFastSinus(    "UltraFast   C/C++ sin() func");
TestUltraFastCosinus(  "UltraFast   C/C++ cos() func");

}

``````

=> the integrated/optimised versions of sin/cos presents in TestUltraFastSinus() and TestUltraFastCosinus() can to be more than 4x more speed than the standards C/C++ sin()/cos() function

With the float version :

``````
Standard    C/C++ sin() func   : 90 ms
Standard    C/C++ cos() func   : 96 ms
FastPrecise C/C++ sin() func : 38 ms
FastPrecise C/C++ cos() func : 57 ms
Fast        C/C++ sin() func : 27 ms
Fast        C/C++ cos() func : 36 ms
UltraFast   C/C++ sin() func : 15 ms
UltraFast   C/C++ cos() func : 21 ms

``````

(the code make 1 000 000 sinus (or cosinus) computations at each benchmark)

With the double version :

``````
Standard    C/C++ sin() func   : 87 ms
Standard    C/C++ cos() func   : 95 ms
FastPrecise C/C++ sin() func : 65 ms
FastPrecise C/C++ cos() func : 73 ms
Fast        C/C++ sin() func : 51 ms
Fast        C/C++ cos() func : 62 ms
UltraFast   C/C++ sin() func : 17 ms
UltraFast   C/C++ cos() func : 20 ms

``````

Is normal that the standard and ultrafast versions seem equivalents with float and double ?
(my chronometer is certainly not the best in the world but the double versions of sin/cos seem however more speed than the float versions …)

I think than the the Fast and FastPrecise versions are less speed in the double version because they have a bigger size of arguments to handle in input and output
(cf. this is the function push/pop of values that make the différence)

I think add speedly the check of differences between the reference value returned by sin/cos() and the various optimised computation of them using fast_cos/sin() , fastprecise_sin/cos() and the integrated computation of fast_sin/cos() included into TestUltraFastSinus/Cosinus()
(I can now work without any problem with doubles instead floats since computations with doubles seem as fast as computations with float on my Linux box => is it a normal thing ???)

I have make/debugged a new version that compute difference between goods values given by sin/cos and approximed/optimised versions of them

``````
#include <chrono>
#include <iostream>
#include <unistd.h>
#include <time.h>
#include <math.h>

#define PRECISION float
#define BENCHSIZE 10000000
#define WRAPPING  1

// #define BENCHINC  0.000001f
PRECISION BENCHINC   = (6.28318531 / BENCHSIZE) * 2.0f;
PRECISION BENCHSTART = -BENCHSIZE * BENCHINC;

///////////////////////////
// Chronometer functions //
///////////////////////////
using Clock = std::chrono::high_resolution_clock;
using std::chrono::milliseconds;
using std::chrono::nanoseconds;
using std::chrono::duration_cast;

auto start = Clock::now();
auto end = Clock::now();
auto delai = duration_cast<milliseconds>(end-start).count();

inline void BeginTimer()
{
start = Clock::now();
}

inline void EndTimer()
{
end = Clock::now();
}

void PrintTimer(char *title)
{
// std::cout << duration_cast<nanoseconds>(end-start).count() << " ns
";
std::cout << title << " : "<< duration_cast<milliseconds>(end-start).count() << " ms ";
}

///////////////////////////////
// Tests standards functions //
///////////////////////////////
void TestStandardSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
sinus    = sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestStandardCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
cosinus    = cos(val);
}

EndTimer();

PrintTimer(title);
}

//////////////////////////////////
// Fast approximation functions //
//////////////////////////////////
inline PRECISION fast_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
#ifdef WRAPPING
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
sin = 1.27323954 * x + 0.405284735 * x * x;
else
sin = 1.27323954 * x - 0.405284735 * x * x;

return sin;
}

inline PRECISION fast_cos( PRECISION x)
{
PRECISION cos;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

//always wrap input angle to -PI..PI
#ifdef WRAPPING
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
cos = 1.27323954 * x + 0.405284735 * x * x;
else
cos = 1.27323954 * x - 0.405284735 * x * x;

return cos;
}

////////////////////////////////////////
// Tests fast approximation functions //
////////////////////////////////////////
void TestFastSinus(char *title)
{
int i;
PRECISION val, sinus;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
sinus    = fast_sin(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = sin(val);
sinus    = fast_sin(val);

diff = sinus - ref;
moy  += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

// std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestFastCosinus(char *title)
{
int i;
PRECISION val, cosinus;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
cosinus    = fast_cos(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = cos(val);
cosinus    = fast_cos(val);

diff = cosinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

///////////////////////////////
// Tests ultra fast versions //
///////////////////////////////
void CheckUltraFastSinus()
{
int i;
PRECISION val, sinus, x;
PRECISION  ref, diff, moy = 0, variance=0, mini = 1.0f , maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
ref = sin(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
sinus = 1.27323954 * x + .405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;

// check diff
diff = sinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastSinus(char *title)
{
int i;
PRECISION val, sinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531 ;
#endif

//compute sine
if (x < 0)
sinus = 1.27323954 * x + 0.405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);

CheckUltraFastSinus();
}

void CheckUltraFastCosinus()
{
int i;
PRECISION val, cosinus, x;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f, maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;
ref = cos(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute cosine
if (x < 0)
cosinus = 1.27323954 * x + 0.405284735 * x * x;
else
cosinus = 1.27323954 * x - 0.405284735 * x * x;

// check diff
diff = cosinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastCosinus(char *title)
{
int i;
PRECISION val, cosinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
cosinus = 1.27323954 * x + 0.405284735 * x * x;
else
cosinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);

CheckUltraFastCosinus();

}

///////////////////////////////////////////////
// Fast but precises approximation functions //
///////////////////////////////////////////////
inline PRECISION fast_precise_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
#ifdef WRAPPING
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
{
sin = 1.27323954 * x + .405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}
else
{
sin = 1.27323954 * x - 0.405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}

return sin;
}

inline PRECISION fast_precise_cos( PRECISION x)
{
PRECISION cos;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

// always wrap input angle to -PI..PI
#ifdef WRAPPING
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
{
cos = 1.27323954 * x + 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}
else
{
cos = 1.27323954 * x - 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}

return cos;
}

////////////////////////////////////////////////////
// Tests fast but precise approximation functions //
////////////////////////////////////////////////////
void TestFastPreciseSinus(char *title)
{
int i;
PRECISION val, sinus;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
sinus    = fast_precise_sin(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = sin(val);
sinus    = fast_precise_sin(val);

diff = sinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestFastPreciseCosinus(char *title)
{
int i;
PRECISION val, cosinus;
PRECISION  ref, diff, moy = 0, variance=0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
cosinus    = fast_precise_cos(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = cos(val);
cosinus    = fast_precise_cos(val);

diff = cosinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

///////////////////////////////////////////
// Tests ultra fast and precise versions //
///////////////////////////////////////////
void CheckUltraFastPreciseSinus()
{
int i;
PRECISION val, sinus, x;
PRECISION  ref, diff, moy = 0, variance=0, mini = 1.0f , maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
ref = sin(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
{
sinus = 1.27323954 * x + .405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}
else
{
sinus = 1.27323954 * x - 0.405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}

// check diff
diff = sinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastPreciseSinus(char *title)
{
int i;
PRECISION val, sinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531 ;
#endif

//compute sine
if (x < 0)
{
sinus = 1.27323954 * x + .405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}
else
{
sinus = 1.27323954 * x - 0.405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}
}

EndTimer();

PrintTimer(title);

CheckUltraFastPreciseSinus();
}

void CheckUltraFastPreciseCosinus()
{
int i;
PRECISION val, cosinus, x;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f, maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;
ref = cos(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
{
cosinus = 1.27323954 * x + 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}
else
{
cosinus = 1.27323954 * x - 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}

// check diff
diff = cosinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastPreciseCosinus(char *title)
{
int i;
PRECISION val, cosinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
{
cosinus = 1.27323954 * x + 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}
else
{
cosinus = 1.27323954 * x - 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}
}

EndTimer();

PrintTimer(title);

CheckUltraFastPreciseCosinus();

}

//////////
// main //
//////////
int main()
{
std::cout << "
Benchmark cos()/sin() optimization with " << BENCHSIZE << " iterations" << "
";
std::cout << "(from " << BENCHSTART << " radians to " << -BENCHSTART << " radians with a stepsize of " << BENCHINC << ")

";

TestStandardSinus(             "Standard C/C++ sin()       "); std::cout << "
";
TestStandardCosinus(           "Standard C/C++ cos()       "); std::cout << "

";

TestFastPreciseSinus(          "inlined fast_precise_sin() "); std::cout << "
";
TestFastPreciseCosinus(        "inlined fast_precise_cos() "); std::cout << "

";

TestUltraFastPreciseSinus(        "UltraFastPreciseSinus()    "); std::cout << "
";
TestUltraFastPreciseCosinus(      "UltraFastPreciseCosinus()  "); std::cout << "

";

TestFastSinus(                 "inlined fast_sin()         "); std::cout << "
";
TestFastCosinus(               "inlined fast_cos()         "); std::cout << "

";

TestUltraFastSinus(           "UltraFastSinus()           "); std::cout << "
";
TestUltraFastCosinus(          "UltraFastCosinus()         "); std::cout << "

";
}

``````

This give this with the float version :

``````
Benchmark cos()/sin() optimization with 10000000 iterations

Standard C/C++ sin()        : 745 ms
Standard C/C++ cos()        : 774 ms

inlined fast_precise_sin()  : 598 ms (min/moy/variance/max)=(-0.001090974/0.000029710/0.000597704/0.001090437)
inlined fast_precise_cos()  : 695 ms (min/moy/variance/max)=(-0.001091599/-0.000001502/0.000588869/0.001091063)

UltraFastPreciseSinus()     : 520 ms (min/moy/variance/max=(-0.001090974/0.000029710/0.000597704/0.001090437)
UltraFastPreciseCosinus()   : 702 ms (min/moy/variance/max)=(-0.001091599/-0.000001502/0.000588869/0.001091063)

inlined fast_sin()          : 506 ms (min/moy/variance/max)=(-0.056010097/0.009407033/0.035580181/0.056009740)
inlined fast_cos()          : 589 ms (min/moy/variance/max)=(-0.056010574/-0.000039861/0.035341624/0.056010306)

UltraFastSinus()            : 382 ms (min/moy/variance/max=(-0.056010097/0.009407033/0.035580181/0.056009740)
UltraFastCosinus()          : 461 ms (min/moy/variance/max)=(-0.056010574/-0.000039861/0.035341624/0.056010306)

``````

And with the double version :
(only change the “#define PRECISION float” by “#define PRECISION double” into the source code)

``````
Benchmark cos()/sin() optimization with 10000000 iterations

Standard C/C++ sin()        : 737 ms
Standard C/C++ cos()        : 765 ms

inlined fast_precise_sin()  : 903 ms (min/moy/variance/max)=(-0.001090303/0.000000000/0.000596678/0.001090319)
inlined fast_precise_cos()  : 956 ms (min/moy/variance/max)=(-0.001090311/-0.000000000/0.000596679/0.001090326)

UltraFastPreciseSinus()     : 526 ms (min/moy/variance/max=(-0.001090303/0.000000000/0.000596678/0.001090319)
UltraFastPreciseCosinus()   : 571 ms (min/moy/variance/max)=(-0.001090311/-0.000000000/0.000596679/0.001090326)

inlined fast_sin()          : 839 ms (min/moy/variance/max)=(-0.056009604/0.000000000/0.035836168/0.056009589)
inlined fast_cos()          : 801 ms (min/moy/variance/max)=(-0.056009610/-0.000000000/0.035836167/0.056009594)

UltraFastSinus()            : 403 ms (min/moy/variance/max=(-0.056009604/0.000000000/0.035836168/0.056009589)
UltraFastCosinus()          : 462 ms (min/moy/variance/max)=(-0.056009610/-0.000000000/0.035836167/0.056009594)

``````

Only the UltraFast[Precise]Sinus/Cosinus() implementations are faster than standards cos/sin() functions in the double version
(the fast_sin/cos() and fast_precise_sin/cos() funcs are more slow in the double version)
[but they are alls more speed on the float version :)]

PS : I don’t see really an amelioration with the double version compared to the float version => this is normal ???
(the float version is more fast but on the other side the double version don’t seem to be more precise …)

Yes, they are. Especially for the single precision calculated counterparts.
Only if you could use more precise variables than doubles you should worry about precision of C-math function implementation. Assume 15 decimal positions are absolutely correct.

Don’t wait till then just to find out something that works on CPU does not work the same way on GPU. The precision is not the same (although vendors claim full IEEE754 compatibility) as well as HW function implementations.

My suggestion is not to try to find ultra-fast but ultra-precise implementation using single precision arithmetic. There are algorithms for double precision simulation with single precision, but they are not trivial and require more computation time that is necessary, unless we really need more that 10 decimal digits precision. If your FP implementations generate error above 1e-3 they are not useful. Built-in functions are several orders of magnitude more precise (except for very small arguments, or borders of the argument range).

PS : I don’t see really an amelioration with the double version compared to the float version => this is normal ???
(the float version is more fast but on the other side the double version don’t seem to be more precise …)

No, that is not normal. DP version should have more precise results. At least 9 orders of magnitude. Just be aware that coefficients for the interpolation polynomials are not the same for SP and DP version, as well as power. You need polynomials of higher order for DP.

The following is almost the best we can get for SP:

``````float GLRenderer::cos_52s(float x)
{
const float c1= 0.9999932946;
const float c2=-0.4999124376;
const float c3= 0.0414877472;
const float c4=-0.0012712095;
float x2; // The input argument squared
x2=x * x;
return (c1 + x2*(c2 + x2*(c3 + c4*x2)));
}
``````

This implementation has 5 exact decimal positions.

For DP, the following works OK:

``````double cos_121s(double x)
{
const double c1= 0.99999999999925182;
const double c2=-0.49999999997024012;
const double c3= 0.041666666473384543;
const double c4=-0.001388888418000423;
const double c5= 0.0000248010406484558;
const double c6=-0.0000002752469638432;
const double c7= 0.0000000019907856854;
double x2; // The input argument squared
x2=x * x;
return (c1 + x2*(c2 + x2*(c3 + x2*(c4 + x2*(c5 + x2*(c6 + c7*x2))))));
}
``````

The author claims 12 decimal position precision, but I would agree with just 11.
Error is above 4e-12.

Lets back to GPU HW implementation. Here is a short summary what I found this weekend:
[ol]
[li]sin/cos functions are HW based[/li][li]Unit that calculate those functions are called Special Function Unit (SFU) in NV terminology[/li][li]Based on High-Performance Area-Efficient Multifunction Interpolator (quadratic interpolation algorithm) - A table lookup is used to retrieve optimized coefficients for a quadratic polynomial approximation, followed by a high-speed evaluation of the polynomial.[/li][li]Accuracy – 22.47 (the number of good bits in the prenormalized results)[/li][li]Kepler uses the same logic as Fermi[/li][li]Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.[/li][/ol]

In short, they use quadratic interpolation with coefficient lookup tables.

No, we can continue to discuss more precise software implementations.

Thank, Alexandar

In short, they use quadratic interpolation with coefficient lookup tables.

Where the lookup table of coefficients is indexed by the angle for example ?

I have tested your first formula but my check find than this approximation seem to give bad result around PI (cf. 180 degrees)

I have using this :

``````
void CheckCosinus_52s()
{
int i;
PRECISION val, cosinus, x, smaller, bigger;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f, maxi = -1.0f;

const float c1= 0.9999932946;
const float c2=-0.4999124376;
const float c3= 0.0414877472;
const float c4=-0.0012712095;
float x2; // The input argument squared

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
ref = cos(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
#ifdef WRAPELSE
else
#endif
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute cosine
x2 = x * x;
cosinus = c1 + x2*(c2 + x2*(c3 + c4*x2));

// check diff
diff = cosinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (6.28318531 * floor(smaller/6.28318531));
bigger   = bigger  - (6.28318531 * floor(bigger /6.28318531));

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/6.28318531),  bigger * (360.0f/6.28318531) );

}

void TestCosinus_52s(char *title)
{
int i;
PRECISION val, cosinus, x;

const float c1= 0.9999932946;
const float c2=-0.4999124376;
const float c3= 0.0414877472;
const float c4=-0.0012712095;
float x2; // The input argument squared

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
#ifdef WRAPPING
// wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
#ifdef WRAPELSE
else
#endif
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute cosine
x2  =x * x;
cosinus = c1 + x2*(c2 + x2*(c3 + c4*x2));
}

EndTimer();

PrintTimer(title);

CheckCosinus_52s();

}

``````

And Add this test of the end of the main()

``````
TestCosinus_52s(          "FastCosinus_52s()          "); std::cout << "

";

``````

And the result is this :

``````
Benchmark cos()/sin() optimization with 1000000 iterations
[with a stepsize of 0.00072 degrees]

Standard C/C++ sin()        : 85 ms
Standard C/C++ cos()        : 86 ms

inlined fast_precise_sin()  : 74 ms (min/moy/variance/max)=(-0.001091/-0.000025/0.000605/0.001090) [smaller=168.9 degrees, bigger=348.9 degrees]
inlined fast_precise_cos()  : 78 ms (min/moy/variance/max)=(-0.001092/0.000010/0.000588/0.001091) [smaller=78.9 degrees, bigger=258.9 degrees]

UltraFastPreciseSinus()     : 60 ms (min/moy/variance/max=(-0.001091/-0.000025/0.000605/0.001090) [smaller=168.9 degrees, bigger=348.9 degrees]
UltraFastPreciseCosinus()   : 69 ms (min/moy/variance/max)=(-0.001092/0.000010/0.000588/0.001091) [smaller=78.9 degrees, bigger=258.9 degrees]

inlined fast_sin()          : 55 ms (min/moy/variance/max)=(-0.056010/0.002068/0.036154/0.056010) [smaller=333.0 degrees, bigger=27.0 degrees]
inlined fast_cos()          : 63 ms (min/moy/variance/max)=(-0.056011/0.000306/0.035148/0.056010) [smaller=117.0 degrees, bigger=297.0 degrees]

UltraFastSinus()            : 43 ms (min/moy/variance/max=(-0.056010/0.002068/0.036154/0.056010) [smaller=333.0 degrees, bigger=27.0 degrees]
UltraFastCosinus()          : 53 ms (min/moy/variance/max)=(-0.056011/0.000306/0.035148/0.056010) [smaller=117.0 degrees, bigger=297.0 degrees]

FastCosinus_52s()           : 52 ms (min/moy/variance/max)=(-0.114787/-0.010663/0.025695/0.000007) [smaller=180.0 degrees, bigger=83.1 degrees]

``````

(this is the last line that use your first formula, no tested the second formula)
[cf. a very good precision around PI/2, cf 83.1 degrees, but very bad at PI cf. 180.0 degrees]
=> the formula isn’t usable between -PI to PI but between 0 to PI/2 or something like this ?

We can perhaps use NURBS or something like this instead an indexed lookup of quadratics curve’s coefficients ?
(or interpolate quadratics coefficients, instead “only” values computed by them, between two or more lookup entries ?)
[the current NV hardware implementation seem to use something like a piecewyse quadratic function where each piece, cf. quadratic’s coefficients, is given by an index/angle lookup]

=> is the quadratic’s coefficients table lookups used by NV strictly private or it is accessible somewhere ?
(in alls cases, I think that we can recompute ourselves the quadratic coefficients of each piece if we have values and their derivatives at the begin and the end of each “step/index lookup” [values can easily be computed using cos(x)/sin(x) and their derivative as easily by -sin(x)/cos(x) … But OK all the difficulty is to compute the goods quadratic’s coefficients from them ])

For a binary input operand X with n-bit significand, the significand is divided into two parts: Xu is the upper part containing m bits, and Xl is the lower part containing n − m bits. The upper m bits Xu are used to consult a set of three lookup tables to return three finiteword coefficients C0, C1, and C2. Each function to be approximated requires a unique set of tables. These coefficients are used to approximate f(X) in the range Xu ≤X <Xu + 2^−m by evaluating the expression.

First, that is not “my” formula. I have just picked it up from Jack G. Ganssle. Approximation should be used only on the interval [0…pi/2].
Why would you use any trigonometric approximation on any other interval? Everything should be reduced to the first quadrant.

But the point is: That is just an example of precise quadratic approximation on the whole interval. GPU vendors already have better ones.
We (or at least I) need higher precision, especially on each end of the interval.

If you have time to make an experiment it will be interesting to hear the results.

[QUOTE=The Little Body;1249472]…(or interpolate quadratics coefficients, instead “only” values computed by them, between two or more lookup entries ?)
[the current NV hardware implementation seem to use something like a piecewyse quadratic function where each piece, cf. quadratic’s coefficients, is given by an index/angle lookup]…[/QUOTE]
See above.

[QUOTE=The Little Body;1249472]=> is the quadratic’s coefficients table lookups used by NV strictly private or it is accessible somewhere ?
(in alls cases, I think that we can recompute ourselves the quadratic coefficients of each piece if we have values and their derivatives at the begin and the end of each “step/index lookup” [values can easily be computed using cos(x)/sin(x) and their derivative as easily by -sin(x)/cos(x) … But OK all the difficulty is to compute the goods quadratic’s coefficients from them ])[/QUOTE]
Why would we do that? NV already have fast implementation with quadratic interpolation built in hardware. Intel has even better (but Intel’s GPUs are part of CPU die, so probably the logic is different). What we could do is to propose more precise implementation in GLSL.

Lot of thanks for alls your explanations

"]First, that is not “my” formula. I have just picked it up from Jack G. Ganssle.

So thanks to Jack G. Ganssle too

I have too find anothers methods at http://www.jeremybouny.fr/2012/10/approximation-cosinus-et-sinus-exemple-en-c/ or http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/ but I haven’t tested it

(but don’t find on this page how he compute the approximations values, only graphics about the precision of the approximation)

Approximation should be used only on the interval [0…pi/2].
Why would you use any trigonometric approximation on any other interval? Everything should be reduced to the first quadrant.

Because it’s certainly more speed to have a fonction that return always a good value than an algo that have to use differents formulas for each quadrant with something like a “cascade” of if/else for to select the good quadrant ?

Why would we do that? NV already have fast implementation with quadratic interpolation built in hardware. Intel has even better (but Intel’s GPUs are part of CPU die, so probably the logic is different).

It’s for that this can to be integrated into anothers implementations like Mesa for to give the same generics possibilities on plateforms that don’t have an Intel or NVidia GPU for example

But the point is: That is just an example of precise quadratic approximation on the whole interval. GPU vendors already have better ones.
We (or at least I) need higher precision, especially on each end of the interval.

Same reasons as previous => for to be the more generic possible and to give the same possibilities around a maximum of GPU platforms or software implementations like Mesa

If you have time to make an experiment it will be interesting to hear the results.

Yes, of course

I cannot say that I can give something that is fonctionnal in somes hours but I think that this can be possible in a not too long delay
=> something like a small number of days/weeks, I don’t think that this can take more than one month to do
(but Ok, I can not predict in advance if this can be “better” [on speed and/or precision] than actuals hardware implementations like Intel or Nvidia solutions [and I have serious doubts about the possibilities to have only one of them but I like to dream too :)], the major advantage is than this can to be a generic method that can be virtually used/implemented on alls implementations)

``````if ( x &lt; -3.14159265 )
while ( x &lt; -3.14159265)
x += 6.28318531 ;
``````

No, no, no!

``````
if(x<-3.14159265||x>3.14159265)
x=(x+3.14159265)%6.28318531-3.14159265;//If you insist on never using 'pi' or 'tau'...

``````

Done wrapping

???

I have make a test with your method of wrapping :

``````
UltraFastSinus()            : 84 ms (min/moy/variance/max=(-0.056010/0.002069/0.036154/0.056010) [smaller=332.9 degrees, bigger=152.9 degrees]

``````

But I have this with my initial wrapping …

``````
UltraFastSinus()            : 41 ms (min/moy/variance/max=(-0.056010/0.002068/0.036154/0.056010) [smaller=333.0 degrees, bigger=27.0 degrees]

``````

I use this code :

``````
// Comment this #define for to have the "no, no, no" version
#define MY_WRAP 1

///////////////////////////////
// Tests ultra fast versions //
///////////////////////////////
void CheckUltraFastSinus()
{
int i;
PRECISION val, sinus, x, smaller, bigger;
PRECISION  ref, diff, moy = 0, variance=0, mini = 1.0f , maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
ref = sin(val);

#ifdef WRAPPING
#ifdef MY_WRAP
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
#ifdef WRAPELSE
else
#endif /* WRAPELSE */
// wrap input angle to PI..-PI
if ( x >  3.14159265 )
while ( x >  3.14159265)
x -= 6.28318531;
#else /* FASTER WRAPPING ??? */
if( x < -3.14159265 || x > 3.14159265 )
x = fmod( x + 3.14159265 , 6.28318531) - 3.14159265;
#endif /* MY_WRAP */
#endif /* WRAPPING */

//compute sine
if (x < 0)
sinus = 1.27323954 * x + .405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;

// check diff
diff = sinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (6.28318531 * floor(smaller/6.28318531));
bigger   = bigger  - (6.28318531 * floor(bigger /6.28318531));

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/6.28318531),  bigger * (360.0f/6.28318531) );

}

void TestUltraFastSinus(char *title)
{
int i;
PRECISION val, sinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;

#ifdef WRAPPING
#ifdef MY_WRAP
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
#ifdef WRAPELSE
else
#endif /* WRAPELSE */
// wrap input angle to PI..-PI
if ( x >  3.14159265 )
while ( x >  3.14159265)
x -= 6.28318531;
#else /* FASTER WRAPPING ??? */
if( x < -3.14159265 || x > 3.14159265 )
x = fmod( x + 3.14159265 , 6.28318531) - 3.14159265;
#endif /* MY_WRAP */
#endif /* WRAPPING */

//compute sine
if (x < 0)
sinus = 1.27323954 * x + 0.405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);

CheckUltraFastSinus();
}

``````

Note that I cannot use the % operator because I have this error with g++

error: invalid operands of types double and double to binary operator %
=> I think that the big slowdown is due to the fmod() function instead %
(how to work with the operator % with float or double types ?)

PS : I immediatly replace numericals values with some const or #define

I have found a nice .pdf file at http://hal.inria.fr/ensl-00000002/
=> this pdf seem very hardware oriented and I see that NVIDIA and INTEL are contributors for example
==> I test to understand methods explained in this pdf before to begin to dream about to improve this

Yes, that’s a HW implementation I was talking about.
You have found paper published at IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 3, MARCH 2005.
There’s another version published at Computer Arithmetic, 2005. ARITH-17 2005. 17th IEEE Symposium on Computer Arithmetic (link).

Are tables size really a problem for a desktop plateform (such as a PC, a MAC or a UN*X workstation for example) ?
(of course, this is certainly the case for a smatphone because of his low power/“complexity” need)

I have make tests/benchmark with some approximations formulas on my computer and find that the best precision/speed is given with a very simple array lookup on a moderate size of 65K entries (cf. the TableSinus and TableCosinus benchs)

Here is how I compute/use “direct” tables lookups for to have fast approximations of cos(x) using TableCos(x), and sin(x) using TableSin(x)
(the sin/cos tables have here a size of 65K entries but this can to be modified using STEPS and STEPMOD definitions)

``````
#include "approximations.h"

#define M_PI 3.14159265358979323846

#define STEPS   65536
#define STEPMOD 65535

float Cos_table[STEPS+1];
float Sin_table[STEPS+1];

float rad2idx = (0.5f * (float)(STEPS)) / M_PI;

////////// Appriximation SIN / COS

void InitTables()
{
int i;

for( i = 0 ; i < STEPS ; i++ )
{
// radians = (float)(i) * (M_PI  / 180.0f);
radians = (float)(i) * (2.0f * M_PI  / (float)STEPS);
// printf("SinCos[%d]=(%f, %f)
", i, Sin_table[i], Cos_table[i] );
}
}

{
// float degrees;
// int index;

// degrees = fmod(radians * (180.0f / M_PI), 360.0f);

// degrees = radians * 0.5f * STEPS / M_PI;

// index = (int)(degrees) % STEPS;

// return Cos_table[index];
}

{
// float degrees;
// int   index;

// degrees = fmod(radians * (180.0f / M_PI), 360.0f);
// degrees = radians * (0.5f * STEPS / M_PI);

// index = (int)(degrees) % STEPS;

// return Sin_table[index];
}

////////////////////////////////////////
// Tests Table approximation functions //
////////////////////////////////////////
void TestTableSinus(char *title)
{
register int i;
register float val, sinus;
float smaller = 0, bigger = 0;
float  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART ; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// sinus    = TableSinus(val);
// sinus = Sin_table[ (int)(val* rad2idx) % STEPS ];
sinus = Sin_table[ (int)(val* rad2idx) & STEPMOD ];
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART ; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = sin(val);
// sinus    = TableSinus(val);
// sinus = Sin_table[ (int)(val * rad2idx) % STEPS ];
sinus = Sin_table[ (int)(val* rad2idx) & STEPMOD ];

diff = sinus - ref;
moy  += diff;
variance += diff * diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (TWOPI * floor(smaller/TWOPI));
bigger   = bigger  - (TWOPI * floor(bigger /TWOPI));

// std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/TWOPI),  bigger * (360.0f/TWOPI) );

}

void TestTableCosinus(char *title)
{
register int i;
register float val, cosinus;
float smaller, bigger;
float  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART ; i < BENCHSIZE ; i++ , val += BENCHINC )
{
// cosinus    = TableCosinus(val);
// cosinus = Cos_table[ (int)(val * rad2idx) % STEPS ];
cosinus = Cos_table[ (int)(val * rad2idx) & STEPMOD ];
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = cos(val);
// cosinus    = TableCosinus(val);
// cosinus = Cos_table[ (int)(val * rad2idx) % STEPS ];
cosinus = Cos_table[ (int)(val * rad2idx) & STEPMOD ];

diff = cosinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (TWOPI * floor(smaller/TWOPI));
bigger   = bigger  - (TWOPI * floor(bigger /TWOPI));

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/TWOPI),  bigger * (360.0f/TWOPI) );

}

``````

Here, the benchmark of alls (working) approximations that I have tested :

``````
Benchmark cos()/sin() optimization with 1000000 iterations
[with a stepsize of 0.00144 degrees]

Standard C/C++ sin()        : 77 ms
Standard C/C++ cos()        : 79 ms

inlined fast_precise_sin()  : 71 ms (min/moy/variance/max)=(-0.001091/-0.000002/0.000596/0.001091) [smaller=11.0 degrees, bigger=191.0 degrees]
inlined fast_precise_cos()  : 80 ms (min/moy/variance/max)=(-0.001092/0.000000/0.000595/0.001091) [smaller=78.9 degrees, bigger=258.9 degrees]

inlined fast_sin()          : 57 ms (min/moy/variance/max)=(-0.056011/0.000097/0.035797/0.056010) [smaller=333.0 degrees, bigger=152.9 degrees]
inlined fast_cos()          : 67 ms (min/moy/variance/max)=(-0.056011/0.000017/0.035749/0.056010) [smaller=242.9 degrees, bigger=297.0 degrees]

UltraFastPreciseSinus()     : 21 ms (min/moy/variance/max)=(-0.001091/-0.000002/0.000596/0.001091) [smaller=11.0 degrees, bigger=191.0 degrees]
UltraFastPreciseCosinus()   : 21 ms (min/moy/variance/max)=(-0.001090/-0.000000/0.000597/0.001090) [smaller=168.9 degrees, bigger=191.0 degrees]

FastCosinus_52s()           : 16 ms (min/moy/variance/max)=(-0.114787/-0.010931/0.025966/0.000007) [smaller=180.0 degrees, bigger=34.1 degrees]

UltraFastSinus()            : 12 ms (min/moy/variance/max)=(-0.056011/0.000097/0.035797/0.056010) [smaller=332.9 degrees, bigger=153.0 degrees]
UltraFastCosinus()          : 12 ms (min/moy/variance/max)=(-0.056011/-0.000057/0.035712/0.056010) [smaller=333.0 degrees, bigger=152.9 degrees]

LinearSinus()               : 252 ms (min/moy/variance/max)=(-0.000038/-0.000000/0.000020/0.000038) [smaller=87.5 degrees, bigger=266.5 degrees]
LinearCosinus()             : 252 ms (min/moy/variance/max)=(-0.000038/-0.000000/0.000020/0.000038) [smaller=0.5 degrees, bigger=177.5 degrees]

TableSinus()                : 12 ms (min/moy/variance/max)=(-0.000097/-0.000000/0.000040/0.000097) [smaller=1.6 degrees, bigger=180.6 degrees]
TableCosinus()              : 12 ms (min/moy/variance/max)=(-0.000097/-0.000000/0.000040/0.000097) [smaller=269.9 degrees, bigger=90.8 degrees]

``````

On the other side, I just begin to test/benchmark one piecewyse linear form of this, cf. LinearSinus() and LinarCosinus(), that use a linear combinaison of two consecutives entries but only about 360 entries (one entry by degree)
=> this is really very slow for the moment but I think that this can enough easily to be greatly accelerated by a factor from 5x to 10x and think too replace the linear combinaison by a minmax computation with minmax coefficients of each degree stored into Cosinus/Sinus tables entries instead finals cos/sin precomputed values

``````
#include "approximations.h"

float Cosinus_table[362];
float Sinus_table[362];

#define M_PI 3.14159265358979323846

#define STEPS 360

////////// Approximations of SIN / COS

void InitLinearTables()
{
int i;

for( i = 0 ; i < STEPS + 2 ; i++ )
{
radians = (float)(i) * (M_PI  / 180.0f);
// printf("SinCos[%d]=(%f, %f)
", i, Sinus_table[i], Cosinus_table[i] );
}
}

{
register float degrees, interp, c0, c1;

degrees = fmod(radians * 180.0f / M_PI, 360.0f);

c0 = Cosinus_table[(int)degrees];
c1 = Cosinus_table[(int)(degrees+1)];

interp = degrees - floor(degrees);

return c1 * interp + c0 * (1.0f - interp);
}

{
register float degrees, interp, s0, s1;

degrees = fmod(radians * 180.0f / M_PI, 360.0f);

s0 = Sinus_table[(int)degrees];
s1 = Sinus_table[(int)(degrees+1)];

interp = degrees - floor(degrees);

return s1 * interp + s0 * (1.0f - interp);
}

////////////////////////////////////////
// Tests linear approximation functions //
////////////////////////////////////////
void TestLinearSinus(char *title)
{
int i;
PRECISION val, sinus, smaller = 0, bigger = 0;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
sinus    = LinearSinus(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = sin(val);
sinus    = LinearSinus(val);

diff = sinus - ref;
moy  += diff;
variance += diff * diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (TWOPI * floor(smaller/TWOPI));
bigger   = bigger  - (TWOPI * floor(bigger /TWOPI));

// std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/TWOPI),  bigger * (360.0f/TWOPI) );

}

void TestLinearCosinus(char *title)
{
int i;
PRECISION val, cosinus, smaller, bigger;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
cosinus    = LinearCosinus(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = cos(val);
cosinus    = LinearCosinus(val);

diff = cosinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (TWOPI * floor(smaller/TWOPI));
bigger   = bigger  - (TWOPI * floor(bigger /TWOPI));

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/TWOPI),  bigger * (360.0f/TWOPI) );

}

``````

Hi,

What is now the better “low/not too high cost” FPGA starter kit for to begin to test/implement somes ideas/algos with a real hardware point of view ?

I have read an article at Open Silicium that speak about the ALTERA/TERASIC DE0-nano and the SPARTAN Papillo One for example

But their FPGA platforms seem to have a relatively “slow” clock frequency between 50 and 125 Mhz
=> this seem me really cheaps frequencies when compared to recents processors that work now at some GHz

Is this suffisant for to test if somes of my ideas/concepts can to be interesting or not on an hardware level ?
(I have “only” to know/admit/predict that the speed can to be multiplied by a factor of 10x or more, perhaps 100x ???, if implemented on a real recent industrial [co]processor )

What type of FPGA kit is recommended for to begin to test/implement this type of hardware implementation of trigonometry with something “not too slow”
(I don’t esper to have a very high clock frequency such as some Ghz finded into recents CPU/GPU but something like one or more hundreds of Mhz seem to me a ideal frequency for to begin)

I prefer devellop the VHDL/VERILOG on a Linux platform but I can of course use Windows or Mac OSX platforms development if necessary

I have make somes direct lookup benchmarks with various tables sizes and find this :

``````* the low precision approximation loose when the direct lookup can use more than 128 entries
``````
• the “high precision but speed” approximation loose when the direct lookup can use more than 8192 entries
``````
Benchmark cos()/sin() optimization with 1000000 iterations
[with a stepsize of 0.00144 degrees]

Standard C/C++ sin()        : 79 ms
Standard C/C++ cos()        : 81 ms

UltraFastPreciseSinus()     : 21 ms (min/moy/variance/max)=(-0.001091/-0.000002/0.000596/0.001091) [smaller=11.0 degrees, bigger=191.0 degrees]
UltraFastPreciseCosinus()   : 21 ms (min/moy/variance/max)=(-0.001090/-0.000000/0.000597/0.001090) [smaller=168.9 degrees, bigger=191.0 degrees]

UltraFastSinus()            : 12 ms (min/moy/variance/max)=(-0.056011/0.000097/0.035797/0.056010) [smaller=332.9 degrees, bigger=153.0 degrees]
UltraFastCosinus()          : 12 ms (min/moy/variance/max)=(-0.056011/-0.000057/0.035712/0.056010) [smaller=333.0 degrees, bigger=152.9 degrees]

TableSinus(   64)           : 12 ms (min/moy/variance/max)=(-0.098018/-0.000184/0.040093/0.098016) [smaller=0.0 degrees, bigger=185.6 degrees]
TableCosinus(   64)         : 12 ms (min/moy/variance/max)=(-0.098016/0.000024/0.039972/0.098015) [smaller=275.6 degrees, bigger=95.6 degrees]

TableSinus(  128)           : 12 ms (min/moy/variance/max)=(-0.049068/-0.000097/0.020060/0.049063) [smaller=0.0 degrees, bigger=180.0 degrees]
TableCosinus(  128)         : 11 ms (min/moy/variance/max)=(-0.049063/0.000013/0.019991/0.049062) [smaller=270.0 degrees, bigger=90.0 degrees]

TableSinus(  256)           : 11 ms (min/moy/variance/max)=(-0.024541/-0.000053/0.010037/0.024536) [smaller=0.0 degrees, bigger=180.0 degrees]
TableCosinus(  256)         : 11 ms (min/moy/variance/max)=(-0.024537/0.000007/0.009996/0.024536) [smaller=270.0 degrees, bigger=90.0 degrees]

TableSinus(  512)           : 11 ms (min/moy/variance/max)=(-0.012272/-0.000027/0.005019/0.012269) [smaller=0.0 degrees, bigger=179.3 degrees]
TableCosinus(  512)         : 11 ms (min/moy/variance/max)=(-0.012267/0.000003/0.004999/0.012269) [smaller=270.0 degrees, bigger=89.3 degrees]

TableSinus( 1024)           : 11 ms (min/moy/variance/max)=(-0.006136/-0.000013/0.002510/0.006137) [smaller=0.0 degrees, bigger=180.4 degrees]
TableCosinus( 1024)         : 12 ms (min/moy/variance/max)=(-0.006137/0.000002/0.002500/0.006136) [smaller=270.4 degrees, bigger=90.4 degrees]

TableSinus( 2048)           : 11 ms (min/moy/variance/max)=(-0.003068/-0.000007/0.001255/0.003069) [smaller=0.0 degrees, bigger=179.8 degrees]
TableCosinus( 2048)         : 12 ms (min/moy/variance/max)=(-0.003069/0.000001/0.001250/0.003068) [smaller=270.4 degrees, bigger=89.8 degrees]

TableSinus( 4096)           : 12 ms (min/moy/variance/max)=(-0.001535/-0.000003/0.000628/0.001535) [smaller=0.4 degrees, bigger=179.6 degrees]
TableCosinus( 4096)         : 12 ms (min/moy/variance/max)=(-0.001535/0.000000/0.000625/0.001535) [smaller=270.4 degrees, bigger=91.1 degrees]

TableSinus( 8192)           : 12 ms (min/moy/variance/max)=(-0.000768/-0.000002/0.000314/0.000768) [smaller=1.1 degrees, bigger=179.6 degrees]
TableCosinus( 8192)         : 12 ms (min/moy/variance/max)=(-0.000769/0.000000/0.000313/0.000768) [smaller=270.4 degrees, bigger=90.8 degrees]

TableSinus(16384)           : 12 ms (min/moy/variance/max)=(-0.000385/-0.000001/0.000157/0.000385) [smaller=0.1 degrees, bigger=179.6 degrees]
TableCosinus(16384)         : 13 ms (min/moy/variance/max)=(-0.000385/0.000000/0.000157/0.000385) [smaller=270.4 degrees, bigger=90.8 degrees]

TableSinus(32768)           : 12 ms (min/moy/variance/max)=(-0.000193/-0.000000/0.000079/0.000193) [smaller=0.1 degrees, bigger=179.6 degrees]
TableCosinus(32768)         : 12 ms (min/moy/variance/max)=(-0.000193/0.000000/0.000079/0.000193) [smaller=270.2 degrees, bigger=90.8 degrees]

TableSinus(65536)           : 11 ms (min/moy/variance/max)=(-0.000097/-0.000000/0.000040/0.000097) [smaller=1.6 degrees, bigger=180.6 degrees]
TableCosinus(65536)         : 12 ms (min/moy/variance/max)=(-0.000097/-0.000000/0.000040/0.000097) [smaller=269.9 degrees, bigger=90.8 degrees]

``````

Here is interesting parts of the code used for to handle the direct lookup way for TableSinus() and tableCosinus()

``````
#ifndef _APPROXIMATIONS_H_
#define _APPROXILATIONS_H_

...

////////////////////////////
/// Tables approximations ///
/////////////////////////////

#define MINSTEPS 64
#define MINMOD   63

#define MAXSTEPS 65536
#define MAXMOD  65355

extern  int  tablesSize;
extern  void InitTables(int size);
extern  void TestTableSinus(char *title);
extern  void TestTableCosinus(char *title);
extern  float Cos_table[];
extern  float Sin_table[];

...

``````
``````
#include "approximations.h"

float Cos_table[MAXSTEPS+1];
float Sin_table[MAXSTEPS+1];

int   tablesSize = MAXSTEPS;

float rad2idx = (0.5f * (float)(MAXSTEPS)) / M_PI;

////////// Appriximation SIN / COS

void InitTables( int size = MAXSTEPS)
{
int i;

for( tablesSize = MINSTEPS ; tablesSize < size ; tablesSize <<= 1 )
if( tablesSize * 2 > MAXSTEPS )
break;

rad2idx = (0.5f * (float)(tablesSize)) / M_PI;

for( i = 0 ; i <= size ; i++ )
{
// radians = (float)(i) * (M_PI  / 180.0f);
radians = (float)(i) * (2.0f * M_PI  / (float)(tablesSize));
// printf("SinCos[%d]=(%f, %f)
", i, Sin_table[i], Cos_table[i] );
}
}

void Renormalize2( float *x, float *y)
{
float nx, ny, dist;

nx = *x;
ny = *y;

dist = sqrt(nx*nx + ny*ny);

nx /= dist;
ny /= dist;

*x = nx;
*y = ny;
}

{
// float degrees;
// int index;

// degrees = fmod(radians * (180.0f / M_PI), 360.0f);

// degrees = radians * 0.5f * STEPS / M_PI;

// index = (int)(degrees) % STEPS;

// return Cos_table[index];
}

{
// float degrees;
// int   index;

// degrees = fmod(radians * (180.0f / M_PI), 360.0f);
// degrees = radians * (0.5f * STEPS / M_PI);

// index = (int)(degrees) % STEPS;

// return Sin_table[index];
}

////////////////////////////////////////
// Tests Table approximation functions //
////////////////////////////////////////
void TestTableSinus(char *title)
{
register int i;
register float val, sinus;
float smaller = 0, bigger = 0;
float  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART ; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// sinus    = TableSinus(val);
// sinus = Sin_table[ (int)(val* rad2idx) % STEPS ];
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART ; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = sin(val);
// sinus    = TableSinus(val);
// sinus = Sin_table[ (int)(val * rad2idx) % STEPS ];

diff = sinus - ref;
moy  += diff;
variance += diff * diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (TWOPI * floor(smaller/TWOPI));
bigger   = bigger  - (TWOPI * floor(bigger /TWOPI));

// std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/TWOPI),  bigger * (360.0f/TWOPI) );

}

void TestTableCosinus(char *title)
{
register int i;
register float val, cosinus;
float smaller, bigger;
float  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART ; i < BENCHSIZE ; i++ , val += BENCHINC )
{
// cosinus    = TableCosinus(val);
// cosinus = Cos_table[ (int)(val * rad2idx) % STEPS ];
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = cos(val);
// cosinus    = TableCosinus(val);
// cosinus = Cos_table[ (int)(val * rad2idx) % STEPS ];

diff = cosinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (TWOPI * floor(smaller/TWOPI));
bigger   = bigger  - (TWOPI * floor(bigger /TWOPI));

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/TWOPI),  bigger * (360.0f/TWOPI) );

}

``````
``````
for( size = MINSTEPS ; size <= MAXSTEPS ; size *= 2)
{
InitTables(size);
sprintf(title, "TableSinus(%5d)          ", tablesSize);
TestTableSinus(title); std::cout << "
";
sprintf(title, "TableCosinus(%5d)        ", tablesSize);
TestTableCosinus(title); std::cout << "

";
}

``````

=> now, I can begin to dream about using directs lookups into relatively smalls tables for to retrieve adapted minmax coefficients at each “step” and quickly compute an high precision approximation of sin/cos using this minmax coefficients with the modulated angle value
(I think than this can too work with other trigonometrics and/or transcendates functions such as exp, log or others hyperbolics sin/cos for example)

[QUOTE=The Little Body;1250161]Hi,
What is now the better “low/not too high cost” FPGA starter kit for to begin to test/implement somes ideas/algos with a real hardware point of view ?
[/QUOTE]

The absolutely best I can find: http://shop.ztex.de/product_info.php?products_id=62&language=en
Those who live in USA have much better options, but for us in Europe these are the only viable solutions I found .

I’d recommend first implementing your design, then test it with both Altera/Xilinx tools on various FPGAs you can buy, only then buy the one that suits your needs.
The 50MHz clock of that cheap Altera chip above is just the external interface, meanwhile you can run simple designs at 300MHz. (use internal PLLs to generate the 300MHz clock).
A minor note if you need SDRAM: Altera’s auto-generated implementation is crap, Xilinx MIG doesn’t support SDR.