# Problem with coping golbal to local memory

I have a problem with the local memory, when I try to copy a global buffer C1[12] to local memory something wrong happen,
C1 = {1,1,1,1,2,2,2,2,3,3,3,3}
but after the copy to local memory ‘C’ the element of ‘C’ as following
C = {{1,1,1,2} – it should be 1,1,1,1
{2,2,2,3} – it should be 2,2,2,2
{3,3,3,3}}

``````
#define BLOCK_SIZE 16
#define BLOCK_COL 3

__kernel void exmple1( const __global float * C1,
__global float * O,
const int col,
const int hard)
{
int ar = get_global_id(0);

__local float C[BLOCK_SIZE][BLOCK_COL];

if(ar < col * hard) // col =4, hard=3
{
C[ar/col][ar%col]  = C1[ar];  // col=4
}

barrier(CLK_LOCAL_MEM_FENCE);

if(ar < col * hard)
O[ar]=  C[ar/col][ar%col] ;
}

``````

[ul]
the elements of the output buffer are
0= 1
1= 1
2= 1
3= 2
4= 2
5= 2
6= 2
7= 3
8= 3
9= 3
10= 3
11= 3
[/ul]

so what is the problem here?! any idea?!!

I see one problem …

Anyway, why are you using a 2d array in the first place if you’re just accessing it using linear addressing?

Integer division/mod (of a non-constant) can be really slow on a GPU: it might be implemented using shifts and loops.

:shock: wooow I didn’t notice that thank you,

Anyway, why are you using a 2d array in the first place if you’re just accessing it using linear addressing?

Actually the C1 in real is 2d array with size [3][4], and I will use it later in other computation, so I was thinking that convert it to 2d array, since I feel that using C[i][j] is better than C[i+col*j] each time we want to reference any element in the array, is it right?!!

Integer division/mod (of a non-constant) can be really slow on a GPU: it might be implemented using shifts and loops.

other issue that I face last time, the kernel works well when I am using one work group, but if I use more than that I couldn’t understand the result, for example if I use use 2 work groups, and when I do the following :

``````

#define BLOCK_SIZE 16
#define BLOCK_COL 4

#define BLOCK_SIZE 16
#define BLOCK_COL 3

__kernel void exmple1( const __global float * C1,
__global float * O,
const int col,
const int hard)
{
int ar = get_global_id(0);

__local float C[BLOCK_SIZE][BLOCK_COL];

if(ar < col * hard) // col =4, hard=3
{
C[ar/col][ar%col]  = C1[ar];  // col=4
}

barrier(CLK_LOCAL_MEM_FENCE);

[b] O[ar]=  C[0][3] ; // I update this and C[0][3] = 1  [/b]
}
the result will differ between the 2 work groups, for example if I have 8 work items(4 work item in each work group) the the result will be as following:

0= 1
1= 1
2= 1
3= 1

[b]4= 0
5= 0
6= 0
7= 0[/b]

the first 4 result is true but the other 4 is wrong since it should be '1'

``````

Whether you write it, or the compiler adds it: the gpu will still have to perform the address calculation arithmetic. It will not have hardware support for 2d arrays, these are just a software construct.

But your loading of the data will have to do much more arithmetic since col is an argument. You’re in effect taking a flat index to 2d, and then the compiler has to add the code to convert it from 2d to 1d again.

[quote:coiyt44b]
Integer division/mod (of a non-constant) can be really slow on a GPU: it might be implemented using shifts and loops.