Allocate array in a kernel of length known only at runtime

MaximS · April 11, 2012, 8:34am

An array arr is passed to the kernel with some data. Inside the kernel I need a temporary array of the same size as arr. How can I allocate it? I tried to pass the size of arr from host to the kernel and to allocate an array with

float tmp[arrSize]

but the compiler doesn’t accept this because arrSize is not a compile time constant. malloc() is not supported by OpenCL C.

How can I create a temporary array of the same size as an existing one inside a kernel?

Mark_Flamer · April 11, 2012, 1:45pm

My understanding is(and I’m no expert) that you can’t. I think you can create a buffer in local(shared) memory from host code and pass this as an argument to your kernel.

clSetKernelArg(kernel, 0, 16*sizeof(float), NULL);

The fact that the last argument here is null will tell the compiler to allocate space for 16 floats in fast local memory.

affie · April 11, 2012, 2:57pm

One option is to create the temporary array that is the same size as arr being used inside the kernel as a cl_mem object i.e. using clCreateBuffer and pass it also as an argument to kernel.

notzed · April 11, 2012, 3:51pm

MaximS:

An array arr is passed to the kernel with some data. Inside the kernel I need a temporary array of the same size as arr. How can I allocate it? I tried to pass the size of arr from host to the kernel and to allocate an array with
float tmp[arrSize]
but the compiler doesn’t accept this because arrSize is not a compile time constant. malloc() is not supported by OpenCL C.

How can I create a temporary array of the same size as an existing one inside a kernel?

option1: recompile the code using a #define to match the problem size, or some problem size limit.

option2: use local memory, and manually make sure each work item is working on it’s own pool (use local work id + index* arrSize] as the index to avoid bank conflicts. only works if you have a limited amount which will fit. You need to allocate arrsize * local work size so that each item has it’s own block.

option3: pass in global memory big enough to fit allocated on the host, and manually make sure each work item is working on it’s own pool. Probably use similar indexing to above so that accesses are coalesced. i.e. you need to allocate arrSize * global work size so that each work item has it’s own block.

option 1 is the easiest if you know the problem is bounded by some reasonable upper limit.

option 3 is the closest to how a runtime implements 1 internally - ‘private arrays’ are just private ranges of global memory.

Mark_Flamer · April 11, 2012, 4:19pm

option 3 is the closest to how a runtime implements 1 internally - ‘private arrays’ are just private ranges of global memory.

So, an array defined within a kernel like

float temp[12];

is actually allocated in global (slow) memory?

notzed · April 11, 2012, 4:45pm

It depends on how you access it. If you use fixed indices or at least indices which are known at compile time, it should be registerised if it can fit the register file.

If you use dynamic indices or it is too big then yes, it goes into global memory - there’s no where else for it to go.

The only real private memory a gpu has is registers. The closest next thing is local memory which can be used in a private way if you address it properly. I almost always use local memory in this way if I need an internal private array and I have space.

This information is in the various programming guides form the vendors and has been mentioned on forums before. e.g. see section 4.9, page 4-43 of the amd app programming guide 1.3f - much of that is representative of all gpu hardware.

Mark_Flamer · April 11, 2012, 7:18pm

Thanks for the explanation