How to allocate arbitrary sized temp memory per work item?

I’m trying to figure out how to best allocate some temporary memory
per work item in my kernel. The temp memory needed can vary in size
between about 1K and 64K bytes, depending on the overall data
dimensions being processed at the time.

If I use PRIVATE memory, it seems it must be allocated statically
which means I’d need to either:

A. always allocate the maximum size (not good because it doesn’t
usually need that much and would slow down the typical smaller case) or

B. make different versions of the kernel with different sizes of temp
memory (not good because this can result in a large matrix of kernels
since I already need different versions for other reasons) or

C. use a #define for the size and regenerate/rebuild the kernel each
time (not great because the size can change per kernel call and
rebuilding takes some time).

I could try using LOCAL memory instead, but I think my card (GTX 285)
has 16K max local memory per work group, so even when I need less than
that limit it would restrict my work-items/work-group (?)

So am I stuck with GLOBAL memory? (not good because it is much slower)

Or am I missing something that might work better?

Any suggestions?


For your use case it looks like you will need to either use global memory and split it across work-items as necessary or simply use a lot of private memory.

Is it possible to rearrange the algorithm into smaller pieces so that you can make-do with local memory? Or at least move to local memory some data for a while, then swap it out to global memory.