OpenCL Binary | Effective Memory Usage

Timo_Burggraf · June 10, 2012, 10:10pm

Hi Community!

Right know I try to handle a problem, and my google-Research did not work out. I do some calculations of Voronoi-Cells (nearest neighbor-problem). Therefore, I write into my Result array, which is quite big (Number of Points > 250.000).

I only want to store True/False to keep the information of closest neighbors. (i.e. Pnt 12 is Neighbour of Pnt 13 => 12/13 = 1)

I compressed this information to a uint8 linear array (Values 0-255) and I do bit manipulations to store 8 Neighbor Informations.

Furthermore the array grows with the Gaussian sum:

Matrix looks like:

0 1 2 3

0

1 0

2 0 1

3 0 1 0

Source:

nachbar = zeros(array_size, dtype=uint8)

nachbar_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.USE_HOST_PTR, hostbuf=nachbar)

int index = (cur_compare_pnt-1)*cur_compare_pnt/2 + cur_pnt;

int main_index = index / 8;

int sub_index = index % 8;

nachbar_buf[main_index] |= 1<<sub_index;

I am not sure, if this was understandable… :shock: I hope so. My problem is, that this process is not vectorizable, since I write with different Point Tupels into the same uint8 Entry.

My question is: Is there a way to use a bool-linear- array in OpenCL? Or is there a better way at all?

notzed · June 11, 2012, 5:11pm

Is it even worth compressing? 250k is nothing on modern hardware. Of course, if this grouping makes the next stage of processing that much more efficient that is another matter.

But you have some options:

do a full bytes worth in each thread. less parallelism but with a big enough problem it should still be ok
use atomic_or to merge results from each group of N threads
use a reduction algorithm to perform the bit merging without atomics

For atomics you probably want to use local memory as a staging area, as with the reduction algorithm. The reduction algorithm might be better as atomics can be slow - although most hardware has specialised units which make certain operations fast.

Timo_Burggraf · June 11, 2012, 10:43pm

Hi notzed,

thanks for your answer. The Problem is not the 250k Points, the problem is that i have to store the nearest neighbor information. The amout of data is given by
size = 1/2 * N * ( N -1). This is for 250k ca. 31G Connections (so. ca. 31 GB)

Right now I try to work with semaphors and atomic in generell but somehow I am to unskilled to be successful:-) I simply dont understand the idea of barriers and local groups.

Does somebody has a nice tutorial for this kind of problems?

Timo_Burggraf · June 11, 2012, 10:50pm

Hi notzed,

“do a full bytes worth in each thread. less parallelism but with a big enough problem it should still be ok”

I was able to apply this idea to some other code, which creates large matrizes parallely.

Source:
__kernel void create_big_ones( __global unsigned char *big_ones)
{
int main_index = get_global_id(0);
for( int sub_index = 0; sub_index < 8; sub_index++)
{
int index = sub_index + main_index * 8;

        float id1_temp = sqrt(2 * index + 0.25f) + 0.5f;
        int id1 = (int) id1_temp;
        int id0 = index - (id1-1)*id1/2;
        
        //printf("%d %d %d\

",index,main_index,sub_index);
//printf("Points to check: %d %d %d
", id0, id1, index);

        int main_index = index / 8;
        if(id0 == 1)
        {
            big_ones[main_index] |= 1&lt;&lt;sub_index;
        }
    }

}

This code work perfectly parallel. But this idea is not appliable for my Nearest Neighbour problem, since I cannot check a specific connection (i.e. Point 12 to Point 14 ? Neighbors). I start the code with a specific point and get a list of Neighbors. So, I do not know which byte will be modified in advance.

:shock: I hope that this is somehow understandable to anouther person :shock:

Ciao and thanks
Timo