I am porting a code to GPU that uses an huge circular buffer (typically 3Gb) organized as rows (of typically 64k).
In the CPU code it is implemented as an array of pointer each pointing to a row of data.
When we need to rotate the buffer, we only rotate the small pointers array.
In the CPU case, the data itself is allocated as one bloc (of 3Gb), but in a GPU I must allocate the data in 4 buffers (due to the CL_DEVICE_MAX_MEM_ALLOC_SIZE limit)
If it were a single bloc, I could have used integer to give row permutation instead of pointers
e.g. accessing bufO[row[i]65536+j] instead of buf[i][j] with initially buf[i]=buf0+i65536
but if data is allocated as 4 blocs (say float buf0 buf1 buf2 and buf3) is it possible to have a kernel initializing an array to buf0, buf0+16384, buf0+216384 … buf1, buf1+16384, … and so on and use this array of pointer in successive kernels
(i.e. can we mix pointers to different cl_mem obkects -of same type however- within the same array AND are the cl_mem object addresses constant between successive kernels calls or does the CL_DEVICE_MAX_MEM_ALLOC_SIZE limit means that the addressing space is completely separated and that there are no real pointer but only pointer within one given cl_mem object.)
At least, is a construct like
and then working on the row of data in p[0…16383] valid ?
is it any way to store the pointers as p in an array that could be accessed by a new kernel as a __constant array from which the pointers can be retrieved ?