Passing memory pointers to kernels


I am basically trying to pass buffer objects to my kernel in an array.
I have got a varying amount of buffer objects on my GPU.
Those buffers i want to pass to a kernel. (all int arrays)
I have also got a list of all the corresponding cl_mem objects.

Is there a way to do this?
My Question is basically a duplicate of this:
Maybe with openCL 2.0+ its possible to achieve?

If not what is a good way to work around it?

  • Maybe writing a big buffer and write all the smaller buffers to it?<- Sounds like a lot of unused memory due to the varying amount of buffers as well as a lot of unnecessary memory writing. The small buffers are already on the GPU…

  • The other method as mentioned in the link to change the plain text seems bad and is limited by the max parameter size etc.

  • Maybe with program-space variables?

Maybe you have got an idea or a comment on the methods?

Long story short, you cannot. I’d rather go with the second option. But it is one more problem my idea solves :smiley: In the first option you mentioned, you can create a giant buffer that will contain all possible integer data of your application and allocate a lot of subbuffers (watch alignment) on top of it. And then supply offsets and sizes as a kernel argument. This way you won’t need additional memory copies.
But I discourage you to do this, because GPU’s compilers don’t like dynamic pointer assigment. They want to as much static data as possible. Look at how Microsoft worked around the lack of variadic templates: create a kernel with one argument, with two arguments, etc.

Well ok thats a bit disappointing.
I think then i’ll make a mix of Method 1 and 2.
Thanks for your Answer! I hope your Idea will get realized in the next release :stuck_out_tongue:

I have fiddled around a bit with different methods.
I am not able to really use the second method though. I don’t know how to iterate threw the different Parameters.
Is it possible to call parameters by index or so? Otherwise i can not figure out a way to do it with this method…
Or am I doing it wrong?

I’d suggest you to rethink the structure of your application rather. No matter what you will do, such operation will be costly. Is there a way to split your operation into few calls of a simpler kernel and then agregate the results? If it is not feasible, here is the concept:

struct iterator{ __global int* current_ptr;
int argument;
int element;
__kernel void myKern(constant int sizes_of_buffers[MAX_NUMBER_OF_ARGS+1], int* arg0, int* arg1, int* arg2) 
iterator newIterator(int* arg0){
iterator output;
output.current_ptr = arg0;
output.argument = 0;
output.element = 0;
return output;
iterator nextIterator(iterator old, int* sizes_of_buffers, int* arg0, int* arg1, int* arg2){
iterator output;
if (old.element < sizes_of_buffers[old.argument]){
 output = old;
else {
   output.element = 0;
   ouput.argument = old.argument+1;
   if (output.argument == MAX_NUMBER_OF_ARGS || sizes_of_buffers[output.argument] == 0) //Array must be null-terminated
      output.current_ptr = NULL;
  else switch (output.argument){
        0: output.current_ptr = arg1; break;
        1: output.current_ptr = arg2; break;

return output;

bool isNULL(iterator a){
return (a.current_ptr == NULL);

int* access(iterator a){
return a.current_ptr + a.element;

So, how this works. First, you create an iterator like so:
iterator i =newIterator(arg0);
And then you can make a loop like so:

int accum = 0;
while (!isNULL(i)){
  accum += *access(i);
  i = nextIterator(i, sizes_of_buffer, arg0, arg1, arg2);

It’s gonna be a lot of pain to use nextIterator with lots of parameters so you may want to substitute "sizes_of_buffer… " part with @ symbol. When you’re done with the kernel, use some tool (or merely “Find and replace” of your IDE) to return essential data where it belongs.

You’ll need to adjust this code to fit your needs (or to make it compile for that matter), but hopefully, the idea is clear enough. And I once again warn you, compiler will not be happy.

    0: output.current_ptr = arg1; break;
    1: output.current_ptr = arg2; break;

replaced with

    1: output.current_ptr = arg1; break;
    2: output.current_ptr = arg2; break;

Ahh ok, so your basically using a huge switch statement to iterate threw the arguments.
Thanks a lot for that detailed great example!
As you have mentioned i might rethink my application structure then. I change it so i have one small kernel which gets executed multiple times (very often) for the different memory chunks and the work between the chunks will be computed by the cpu.
I hope the execution of many kernels is not to slow. But it might not be at a too high cost:
Thank you a lot Salabar!
Have a great weekend,


Ps: If I later get the chance i will compare the two methods performance wise and post the results here. :wink:

It’s nothing. Creating new threads is relatively cheap, especially if you use multuple queues. Don’t hesitate to make many kernels since it improves readability. And it is infinitely faster than any other option.