Efficient use of memory in GPU

Hey guys,

I have been looking for answers now for about a week and cant find anything useful, so here goes.

I have a kernel that takes a global float* as an input parameter, and another as an output. Due to the massive number of global accesses, the CPU is doing the algorithm quicker than the GPU, and I need it the other way around. I tried passing in a local float* to hold temp data from global to local, but it causes the code to error, and it outputs the exact same numbers it did last time I ran my program.

I tried this:

__kernel void simple(
	global const float* input1, //input
	global float* input2, //output
	constant float* input3, //another input
        local float* tempArg, //temp array
	private int numData,
	private int numData2)
int index = get_global_id(0);
//for testing purposes
tempArg[index] = index;

output[index] = tempArg[index]; // this is where it breaks, giving me incorrect values
//output[index] = index //works, if I dont have the local arg in the kernel parameters

is it because I am running out of memory, or is it because something else is wrong? I am trying to make it faster, but it just keeps giving me crap values


creating the tempBuffer inside the GPU works, rather than passing it in, but I cannot pass the size a variable, it has to be hardcoded. Could there be a workaround to this?

If the amount of local memory you need changes dynamically, you will have to pass the local pointer as an argument to the kernel and use clSetKernelArg() to indicate the amount of local memory you need.

As for the kernel not working correctly, keep in mind that the amount of local memory available in the hardware is going to be limited to a few KB, and therefore doing something like “tempArg[get_global_id(0)] = foo;” will not work because get_global_id() will typically return large values.

You can query the amount of local memory available in your hardware with clGetDeviceInfo(…, CL_DEVICE_LOCAL_MEM_SIZE, …). You can also query the amount of local memory currently used by your kernel with clGetKernelWorkGroupInfo(…, CL_KERNEL_LOCAL_MEM_SIZE, …). The latter must be less than or equal to the former.

Also, it’s a good idea to always check whether OpenCL API calls return an error code. It’s likely that clEnqueueNDRangeKernel() was returning an error code when you tried running that kernel.