CL_OUT_OF_RESOURCES problem when an expression gets 2 long?

billconan · November 28, 2009, 1:36am

hello guys,

i’m trying to put my runge kutta integration code on the gpu with opencl. here is an introduction of the algorithm http://en.wikipedia.org/wiki/Runge%E2%8 … ta_methods if you are not familiar with runge kutta

but that is not important. the point is, one of the expressions is this one:

result.x=pos.x +(k1.x+k2.x2.0f+k3.x3.0f+k4.x)*(1.0f/6.0f);

where “result”, “pos”, “k1”, “k2” and “k3” are all vectors defined as this:

typedef struct
{
float x;
float y;
float z;
} Vector;

my code compiles, but has a runtime error, code -5 CL_OUT_OF_RESOURCES at this line.

i’m guessing that i may use up all the registers of the gpu?

i tried to replace that line with simpler equations, the code can run. but if i use the original equation, the code doesn’t run.

i don’t know how i can solve this problem. and i don’t know what the problem is?
thanks

dbs2 · November 28, 2009, 4:05am

Sounds like a bug in the OpenCL implementation. Which one are you using?
The only other thing I can think of is to make sure you are checking the maximum kernel workgroup size if you are explicitly setting your local size. That size is determined by the GPU based on the number of registers used.

billconan · November 28, 2009, 3:04pm

hi, thank you.

writing opencl is so frustrating . i can’t tell if a problem of my program, or a problem of the driver.

i’m using the nvidia notebook beta driver, 190.189 something on my mac book which has a nv9400 card, under windows xp.

now, the program is really weird, the following will not run and give me the cl_out_resources error code.

tempIndex=((int)pos.x+1 + (int)pos.y*xSize + (int)pos.z*xSize*ySize)*3;
//sresult[241]=tempIndex;

l.x=vectorfieldbuffer[tempIndex];
l.y=vectorfieldbuffer[tempIndex+1];
l.z=vectorfieldbuffer[tempIndex+2];

however, if i uncomment “sresult[241]=tempIndex”, the program will run. how come? sresult is just an output array that i use to read back data for debugging.

is this a driver bug? i should really try cuda.

you are right, i didn’t check the maximum kernel workgroup size. i’m modifying a opencl demo code, which is the vector add demo. that one explicitly set the workgroup size to 256; i didn’t change this setting. is this because the computation required for vector add is much simpler than my program. so i used up all the register but the original demo doesn’t?

how to debug opencl under visual studio? is there any way to run the code with a cpu emulator?

what about cuda. i wrote cuda 1 year ago, but i forgot everything. i thought i need to re-learn everything anyway, so i picked up opencl. i should have stick with cuda.

dbs2 · November 29, 2009, 7:18am

If you’re running on a macbook then you could also use Apple’s OpenCL which has both GPU and CPU OpenCL devices. I think you are correct in that you will find that Nvidia’s Cuda drivers are more mature at the moment than their OpenCL drivers. The Apple drivers seem to be somewhat more mature than the Nvidia ones, though, so you might try booting into OS X to see if it’s an Nvidia or CL problem.

affie · November 30, 2009, 2:31pm

You will get CL_OUT_OF_RESOURCES if you are using a work-group size i.e. the local_work_size argument in clEnqueueNDRangeKernel that will cause to run out of registers or local memory.

I recommend you use the value returned by clGetKernelWorkGroupInfo(kernel, CL_KERNEL_WORK_GROUP_SIZE, …) as local_work_size argument value to clEnqueueNDRangeKernel.

I also agree with dbs2 that if you are running on a MacBook, I recommend using Apple’s OpenCL implementation.