__kernel void deform( __constant float4 * inCage,
__constant uint * wnum,
__constant uint * wskip,
__constant float *weights,
__constant uint *cageID,
__global float4 * outMesh,
const uint nend)
uint i = get_global_id(0);
if(i < nend)
float4 pos = (float4) (0.0f, 0.0f, 0.0f, 0.0f);
uint j = wskip[i];
uint end = j + wnum[i];
for(; j < end ; j++)
pos += inCage[cageID[j]] * weights[j];
outMesh[i] = pos;
Got above code working correctly on CPU, but GPU outputs mess and it’s about 100x slower. My NDRange is over 30 000. I’ve got Nvidia gtx260, athlon 4-core and windows XP 64bit. Got ATI stream installed and nvidias 195.39 forceware (later ones output mess that wasn’t affected by input in any way, these drivers at least react to input by outputting mess that changes).
My first though was that slowdown might be because it’s all in global/constant memory. Unfortunately I am in loss how to get inCage to local memory nicely as it’s not possible to assign pointer from __global to __local. How to copy inCage to local memory efficiently?
I’ve ditched all zero weights from data and thus for loop length depends on how many weights exist per i. Every element of cageID and weights are read only once in whole NDRange, so it would be logical to keep them as __constant or __global.
Have you made sure that the GPU is not returning any error codes when you call clEnqueueNDRangeKernel()? It looks like you are using a ton of __constant arguments and there are limits to these. Try replacing __constant with __global in the source code and see if the code runs correctly this time.
Also, instead of passing that “nend” argument I recommend changing the values you pass to clEnqueueNDRangeKernel() so that you only spawn “nend” work-items. Do something like this:
errcode = clEnqueueNDRangeKernel(queue, kernel, 1, 0, nend, 0, num_wait, wait_list, &event);
How to copy inCage to local memory efficiently?
If you are using OpenCL 1.1. then you can use async_work_group_copy() (look it up in the spec).
Thanks for your reply.
There were no error codes returned at any stage. I’m going to give __global a go though I’m quite sure I’ve tried it before. As for nend and if, I observed that on cpu it’s faster to have that if and set the local worksize by hand and have global worksize be multiple of local worksize. As far as I’ve understood global worksize has to be multiple of local worksize, or does it?
I just went through sample code where the global data was pulled into local data by having different threads and then having a memory barrier to ensure all the data is transferred. Is this good way to go in case I have to settle with opencl 1.0 spec? Hopefully I don’t have to as async_work_group_copy usage looks rather simple to use.
As far as I’ve understood global worksize has to be multiple of local worksize, or does it?
The local work size has to be a multiple of the global work size, but you don’t need to specify a local work size every time. Unless you need a particular size for some reason (usually due to your algorithm), simply pass NULL as the local_size argument of clEnqueueNDRangeKernel(). The driver will pick a value that is most suitable for your device.
I just went through sample code where the global data was pulled into local data by having different threads and then having a memory barrier to ensure all the data is transferred. Is this good way to go in case I have to settle with opencl 1.0 spec?
Yes, it’s the right way to do it