Hello all

I am performing multiplication between two square bit-matrices using a specific formula (not the dot product of row/col, I will describe it below). My implementation works correctly but it blows up when I attempt to do tiling in the local memory. Even though I understand the principles behind it (i.e. putting some of my data in local arrays, then using barriers to synchronize my work-items), I keep getting the wrong results.

This is an example of how I setup the matrices:

```
int N = 256; // number of bits in a row/col (or any number that is evenly divisible by 64)
unsigned long long A = new unsigned long long[n*n/64]; // stores bits in 64-bit integers
unsigned long long B = new unsigned long long[n*n/64]; // for example, one row consists of 256 bits and uses 4x64-bit integers to store them
int C = new int[n*n];
```

This is the actual formula:

Row i of matrix A is XORed with row j of matrix B. Yes, it is a row to row operation. Then, I count the number of 1s and assign the sum to C[i][j].

This is how I launch the kernel:

```
const size_t global[2] = { n, n };
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, 0, 0, NULL, &event);
```

This is the actual kernel which works correctly:

```
__kernel void BitProduct(const int N, const __global ulong* A, const __global ulong* B, __global int* C)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
ulong sum = 0;
for (int k = 0;k < N/64;k++)
sum += popcount( A[ i*(N/64) + k ] ^ B[ j*(N/64) + k ] );
C[ i * N + j ] = (int) sum;
}
```