Array size at runtime

I have a question, probably a simple one. I need to create a “work array” that is only needed during the execution of one “work item”. My understanding is that I can simply declare it as a sort of local variable inside of the kernel, for example:

float temp[1024];

This works just fine, so long as the array size is a constant. But as soon as I set it to something, such as…

float temp[bufferSize];

I start to get “out of resource” errors raised when I try to read the results back from the buffers.

My question, is what is the best way to create a small temp buffer? Does it need to be a param to the kernel? I’ve gotten it to work this way, but then I have to create a very large buffer to hold space for all of my work items.

Okay, I think I found the answer to my own question. When I run this on an AMD, I actually get a compile error. (NVidia did not generate a compile error, just would not work properly).

Line 81: error: expression must
have a constant value
float test[inputSize];

However, I still wonder. If I need to create a small buffer. That will be the same size for each “work item”, yet could change per run of the program, how is it best to create such a temp buffer?

If you know that the maximum size of that buffer is reasonably small, you could always allocate that maximum.

Other than that, you could compile multiple versions of the program with different maximum sizes for that array (you can use a #define for that). Since the array is private to each work-item, performance will suffer the larger is that buffer.

Alternatively, you could use local memory and manually make each work-item index into that local memory so that there’s no overlap between different work-items. Local memory is not as fast as private memory (the methods above) but it saves you the work of recompiling the program. It would look something like this:

__kernel void foo(uint buffer_elements_per_work_item, __local float* buffer_start)
    __local float* work_item_private_buffer = buffer_start + buffer_elements_per_work_item * get_local_id(0);

    // From here on you can use work_item_private_buffer as if it was declared as this:
    // float work_item_private_buffer[buffer_size_per_work_item];