Kernel sincronization problem

I need to perform this code in opencl.

for(i = 0; i < M; i++)
{
for( j = 0; j < N; j++)
{
tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];
}
}

Where f is a Mx(N+1) matrix and tau is a NxN matrix.

When I set the range to M x N and try to perform the computation, i need to do sincronization or an operation can override the other. I tried to use the atom_add but Nvidia implementation just allowed for int and long. I need float.

What I have to do?

You haven’t quite provided enough information to give a good answer. We would need to understand the relationship between the variables tau, f, L, k and the multiple work-groups and work-items that are executing the code. We don’t even know if these variables are in global or local memory. We also don’t know how large are N and M.

Assuming that multiple work-items in the same work-group need to write onto tau while different work-groups do not write into overlapping portions of tau then one solution could be storing temporary results in local memory and then using a reduction to accumulate them and finally store them in global memory. If local memory is not large enough then you can use global memory instead.

Sorry, the code was wrong. There is the right one:

for(k = 0; k < M; k++)
{
for( i = 0; i < N; i++)
{
tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];
}
}

I just using global memory, N x M. N is the dimension of my problem and f the number of “solvers”.

If f is:

[0,1,2,3]
[0,1,2,3]

Then, when i try to update tau, it will try to update tau[0][1] twice, then tau[1][2] twice and go. That’s why i need syncronization.

Thanks for the link, i will try to learn how to use local memory.