Problem with coping golbal to local memory

howaidi · September 14, 2011, 6:58am

I have a problem with the local memory, when I try to copy a global buffer C1[12] to local memory something wrong happen,
C1 = {1,1,1,1,2,2,2,2,3,3,3,3}
but after the copy to local memory ‘C’ the element of ‘C’ as following
C = {{1,1,1,2} – it should be 1,1,1,1
{2,2,2,3} – it should be 2,2,2,2
{3,3,3,3}}


#define BLOCK_SIZE 16
#define BLOCK_COL 3

__kernel void exmple1( const __global float * C1,                      
			__global float * O,			
                        const int col,
			const int hard)
{
	 int ar = get_global_id(0);		
 
 	__local float C[BLOCK_SIZE][BLOCK_COL];	

	if(ar < col * hard) // col =4, hard=3
	 {
	 C[ar/col][ar%col]  = C1[ar];  // col=4
	 }	  

	barrier(CLK_LOCAL_MEM_FENCE);

  	if(ar < col * hard)
  	O[ar]=  C[ar/col][ar%col] ;
}

[ul]
the elements of the output buffer are
0= 1
1= 1
2= 1
3= 2
4= 2
5= 2
6= 2
7= 3
8= 3
9= 3
10= 3
11= 3
[/ul]

so what is the problem here?! any idea?!!

notzed · September 14, 2011, 12:03pm

I see one problem …

Anyway, why are you using a 2d array in the first place if you’re just accessing it using linear addressing?

Integer division/mod (of a non-constant) can be really slow on a GPU: it might be implemented using shifts and loops.

howaidi · September 15, 2011, 3:04pm

:shock: wooow I didn’t notice that thank you,

Anyway, why are you using a 2d array in the first place if you’re just accessing it using linear addressing?

Actually the C1 in real is 2d array with size [3][4], and I will use it later in other computation, so I was thinking that convert it to 2d array, since I feel that using C[i][j] is better than C[i+col*j] each time we want to reference any element in the array, is it right?!!

Integer division/mod (of a non-constant) can be really slow on a GPU: it might be implemented using shifts and loops.

can you please give me more hint about this?

other issue that I face last time, the kernel works well when I am using one work group, but if I use more than that I couldn’t understand the result, for example if I use use 2 work groups, and when I do the following :



#define BLOCK_SIZE 16
#define BLOCK_COL 4

#define BLOCK_SIZE 16
#define BLOCK_COL 3

__kernel void exmple1( const __global float * C1,                     
         __global float * O,         
                        const int col,
         const int hard)
{
    int ar = get_global_id(0);      

   __local float C[BLOCK_SIZE][BLOCK_COL];   

   if(ar < col * hard) // col =4, hard=3
    {
    C[ar/col][ar%col]  = C1[ar];  // col=4
    }    

   barrier(CLK_LOCAL_MEM_FENCE);

   [b] O[ar]=  C[0][3] ; // I update this and C[0][3] = 1  [/b] 
}
the result will differ between the 2 work groups, for example if I have 8 work items(4 work item in each work group) the the result will be as following:

0= 1
1= 1
2= 1
3= 1

[b]4= 0
5= 0
6= 0
7= 0[/b]


the first 4 result is true but the other 4 is wrong since it should be '1'

notzed · September 24, 2011, 12:49am

Whether you write it, or the compiler adds it: the gpu will still have to perform the address calculation arithmetic. It will not have hardware support for 2d arrays, these are just a software construct.

But your loading of the data will have to do much more arithmetic since col is an argument. You’re in effect taking a flat index to 2d, and then the compiler has to add the code to convert it from 2d to 1d again.

[quote:coiyt44b]
Integer division/mod (of a non-constant) can be really slow on a GPU: it might be implemented using shifts and loops.

can you please give me more hint about this?
[/quote:coiyt44b]
Some hardware has no integer division hardware. It has to be done in software by the compiler. It works like long-division, bit-by-bit. I really couldn’t tell you much more than that, and I don’t know if it’s of practical importance now i think about it.

other issue that I face last time, the kernel works well when I am using one work group, but if I use more than that I couldn’t understand the result, for example if I use use 2 work groups, and when I do the following :

It’s hard to see from your example what you’re trying to do, e.g. wont the result depend on C1?

It can be frustrating when things don’t seem to be working, but for such simple examples you’ve probably just forgotten something or made a mistake.