I’m doing a standard sum-of-ints algorithm. The minor twist is that I do this on an 2 dimensional matrix in which I sum each column individually.

The matrix is laid out in global memory in a column after column fashion.

__kernel void reduce(__global ushort *g_matrix,__global uint *g_sums,__local uint *s_local)

I use something like : g_matrix[col*number_of_rows+row] to index the g_matrix.

If I run my algorithm with a small input matrix of say 32*32 ushorts then it works just fine : I get an array with 32 sums, one for each column. If I scale things up though, e.g. 512*512 ushorts, then I seem to get a wrap-around after column 127. Now is 128 columns * 512 rows per column of course equal to 64 k, and as usual in any computing environment strange things happen if you cross the 16 bit boundary (and usually 32 bit as well)

I’ve tried to find a limit in the OpenCL definition which would explain this behavior but couldn’t find anything I considered related. What did I overlook ?

Thanks in advance

What is your local group size?

How do you calculate your local memory index?

My global workgroup size is 256*512 and my local workgroup size is 256*2. The problem doesn’t seem to be in indexing the local memory, but in global memory. When I try to index the g_matrix with anything over 65536 it seems to wrap around to reading at index 0 again. I’ve included the kernel code below.

```
__kernel void reduce(__global ushort *g_matrix,__global uint *g_line,__local uint *s_column)
{
// g_matrix : 0..511,0..511
// g_line : 0..511
// s_column : 0..255,0..1
size_t g_row=get_global_id(0); // 0..255
size_t g_row_work_size=get_global_size(0); // 256
size_t g_column=get_global_id(1); // 0..511
size_t g_column_work_size=get_global_size(1); // 512
size_t l_row=get_local_id(0); // 0..255
size_t l_row_work_size=get_local_size(0); // 256
size_t l_column=get_local_id(1); // 0..1
size_t l_column_work_size=get_local_size(1); // 2
ushort offset=l_row_work_size; // 256,128,64,32,16,8,4,2,1,0
// copy current columns from global to shared memory for faster access and
// do the first addition within a column in a single stroke
s_column[l_column*l_row_work_size+l_row]=
g_matrix[g_column*g_row_work_size*2+g_row]+
g_matrix[g_column*g_row_work_size*2+g_row+offset];
offset>>=1;
barrier(CLK_LOCAL_MEM_FENCE);
// do the remaining additions in the columns in shared memory
while(offset>0)
{
if(l_row<offset)
{
s_column[l_column*l_row_work_size+l_row]+=
s_column[l_column*l_row_work_size+l_row+offset];
}
offset>>=1;
barrier(CLK_LOCAL_MEM_FENCE);
}
// copy sum to global result memory
if(l_row==0)
{
g_line[g_column]=s_column[l_column*l_row_work_size];
}
}
```

Everything works fine for the first 128 columns, but after that things seem to wrap back again to column 0 : the results are identical although the matrix data at column 128 and beyond is different from that at column 0 (for testing I’m providing data which starts with 0 at (0,0) and increments by 1 for each next cell, row first).

// copy current columns from global to shared memory for faster access and

// do the first addition within a column in a single stroke

s_column[l_column*l_row_work_size+l_row]=*

g_matrix[g_columng_row_work_size*2+g_row]+*

g_matrix[g_columng_row_work_size*2+g_row+offset];

Are you sure you have to multiply by 2 here? With buffer size 256x512 (columns by rows I think) jump out of memory after 128.

Hi,

g_matrix is not 256*512, but 512*512 in size, but I’m throwing 256 threads at a time at each column of it, each of which copies two elements from global memory, 256 cells apart, adds the two values and stores the result in local memory.

So g_row_work_size is half the size of a column, and therefore I have to multiply by two to get the column size.

You can find the sizes of the arguments right after the kernel function definition.