Questions on OpenCL Built-in functions?

Gopal_HC · August 27, 2013, 5:25am

Hi,
I am trying to use OpenCL Built-in “rotate” function in one of my kernel as given below, but i am getting following errors while compiling :

clBuildProgram Error for -11 Error Number
error: call to ‘rotate’ is ambiguous
W[0] = rotate(A, 5);
^~~~~
<built-in>:2784:22: note: candidate function
int OVERLOADABLE rotate(int, int);
^
<built-in>:2785:23: note: candidate function
uint OVERLOADABLE rotate(uint, uint);
^
<built-in>:2780:23: note: candidate function
char OVERLOADABLE rotate(char, char);
^
<built-in>:2781:24: note: candidate function
uchar OVERLOADABLE rotate(uchar, uchar);
^
<built-in>:2782:24: note: candidate function
short OVERLOADABLE rotate(short, short);
^
<built-in>:2783:25: note: candidate function
ushort OVERLOADABLE rotate(ushort, ushort);

similar for other datatype…
and so on…

The kernel I used for demo is as ::
__kernel void demoKernel(__global unsigned int *d_io_2d, long max_size)
{
long i = get_global_id(0);
if(i >= max_size)
return;
unsigned int A;
A = d_io_2d[i];

    unsigned int W[1];
    W[0] = rotate(A, 5);

}

My questions are ::

How to use OpenCL Built-in functions? (As in above example, what am i missing actually?)
What is the purpose of using Built-in functions?
Does it improve the performance?

Thanks !!

utnapishtim · August 28, 2013, 10:04am

Try rotate(A, (uint)5)
Some of them can use dedicated hardware circuitry (e.g. rsqrt, mad). They also circumvent the fact that OpenCL C has no standard library.
At least they don’t degrade performance. You can expect that they are optimally implemented, i.e. you won’t be able to do better.

Gopal_HC · August 28, 2013, 10:21pm

[QUOTE=utnapishtim;29622]1. Try rotate(A, (uint)5)
2. Some of them can use dedicated hardware circuitry (e.g. rsqrt, mad). They also circumvent the fact that OpenCL C has no standard library.
3. At least they don’t degrade performance. You can expect that they are optimally implemented, i.e. you won’t be able to do better.[/QUOTE]

Thanku !

after trying rotate(A, (uint)5), my kernel compiled and i got correct result.
My implementation of rotate function is :
uint rotate1(int n, uint x)
{
return (x << n) | (x >> (32-n));
}
when i benchmarked my application with built-in “rotate” function and user-defined “rotate1” function (as mentioned above), I got that built-in “rotate” function was little slower compare to user-defined “rotate1”.
Then how can we expect that they are optimally implemented?

utnapishtim · August 29, 2013, 1:31am

Please note that your implementation of rotate1() breaks down when n>32, whereas the built-in function rotate() is guaranteed to work for any n.

I have checked the assembly code produced by three calls in a row to rotate() with NVIDIA OpenCL.
Using built-in rotate(), 17 instructions are generated (and my NVIDIA hardware has no rotate instruction).
Using your function rotate1(), 19 instructions are generated.

So using built-in rotate() is safer and faster.

Furthermore, if this code was compiled on a hardware device that has a rotate assembly instruction, you would have to rely on the compiler to detect that “(x << n) | (x >> (32-n))” is in fact a rotation and can be optimized into the rotate instruction.