Hello

I am new to OpenCL and want to parallelize some looping code thats doing lu factorization with the looping structure showed by exact code as below:

```
for(int k = 0; k < N-1; k++)
{
for(int i = k+1; i < N; i++)
S[i*N + k] = S[i*N + k] / S[k*N + k];
for(int j = k+1; j < N; j++)
for(int i = k+1; i < N; i++)
S[i*N + j] -= S[i*N + k] * S[k*N + j];
}
```

I have done with the simple opencl kernel with single work items (no groping). Thats following:

```
int IDx = get_global_id(0);
int IDy = get_global_id(1);
for(int k = 0; k < n-1; k++)
{
barrier(CLK_GLOBAL_MEM_FENCE);
if(IDy > k && IDx == k)
matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];
barrier(CLK_GLOBAL_MEM_FENCE);
for(int j = k+1; j < n; j++)
{
if(IDy > k && IDx == j)
matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];
}
}
```

But I dont get correct results when compared to the serial code, this is my personal try for OpenCL kernel and I am still learning how this data parallel scheme in OpenCL works, Can you point out what I am doing wrong in the kernel?