I am trying to use OpenCL Built-in “rotate” function in one of my kernel as given below, but i am getting following errors while compiling :
clBuildProgram Error for -11 Error Number
error: call to ‘rotate’ is ambiguous
W = rotate(A, 5);
<built-in>:2784:22: note: candidate function
int OVERLOADABLE rotate(int, int);
<built-in>:2785:23: note: candidate function
uint OVERLOADABLE rotate(uint, uint);
<built-in>:2780:23: note: candidate function
char OVERLOADABLE rotate(char, char);
<built-in>:2781:24: note: candidate function
uchar OVERLOADABLE rotate(uchar, uchar);
<built-in>:2782:24: note: candidate function
short OVERLOADABLE rotate(short, short);
<built-in>:2783:25: note: candidate function
ushort OVERLOADABLE rotate(ushort, ushort);
similar for other datatype…
and so on…
The kernel I used for demo is as ::
__kernel void demoKernel(__global unsigned int *d_io_2d, long max_size)
long i = get_global_id(0);
if(i >= max_size)
unsigned int A;
A = d_io_2d[i];
unsigned int W;
W = rotate(A, 5);
My questions are ::
- How to use OpenCL Built-in functions? (As in above example, what am i missing actually?)
- What is the purpose of using Built-in functions?
- Does it improve the performance?
[QUOTE=utnapishtim;29622]1. Try rotate(A, (uint)5)
2. Some of them can use dedicated hardware circuitry (e.g. rsqrt, mad). They also circumvent the fact that OpenCL C has no standard library.
3. At least they don’t degrade performance. You can expect that they are optimally implemented, i.e. you won’t be able to do better.[/QUOTE]
- after trying rotate(A, (uint)5), my kernel compiled and i got correct result.
- My implementation of rotate function is :
uint rotate1(int n, uint x)
return (x << n) | (x >> (32-n));
when i benchmarked my application with built-in “rotate” function and user-defined “rotate1” function (as mentioned above), I got that built-in “rotate” function was little slower compare to user-defined “rotate1”.
Then how can we expect that they are optimally implemented?
Please note that your implementation of rotate1() breaks down when n>32, whereas the built-in function rotate() is guaranteed to work for any n.
I have checked the assembly code produced by three calls in a row to rotate() with NVIDIA OpenCL.
Using built-in rotate(), 17 instructions are generated (and my NVIDIA hardware has no rotate instruction).
Using your function rotate1(), 19 instructions are generated.
So using built-in rotate() is safer and faster.
Furthermore, if this code was compiled on a hardware device that has a rotate assembly instruction, you would have to rely on the compiler to detect that “(x << n) | (x >> (32-n))” is in fact a rotation and can be optimized into the rotate instruction.