Where f is a Mx(N+1) matrix and tau is a NxN matrix.
When I set the range to M x N and try to perform the computation, i need to do sincronization or an operation can override the other. I tried to use the atom_add but Nvidia implementation just allowed for int and long. I need float.
You haven’t quite provided enough information to give a good answer. We would need to understand the relationship between the variables tau, f, L, k and the multiple work-groups and work-items that are executing the code. We don’t even know if these variables are in global or local memory. We also don’t know how large are N and M.
Assuming that multiple work-items in the same work-group need to write onto tau while different work-groups do not write into overlapping portions of tau then one solution could be storing temporary results in local memory and then using a reduction to accumulate them and finally store them in global memory. If local memory is not large enough then you can use global memory instead.