let’s assume the following kernel:
#define nx (signed)get_global_id(0)
#define ny (signed)get_global_id(1)
#define Nx (signed)get_global_size(0)
#define Ny (signed)get_global_size(1)
__kernel void parallelSum(__global float* matrix, __global float* sum)
sum += matrix[nx + Nx * ny];
All work items have to access sum at some point. But this can happen only sequentially. So actually there is not much parallelization in this example, right?
It wont work as intended although it wont throw out any errors. I imagine its because without synchronization of some sort the sum variable is not synced and hence doesnt get updated in anything nearing sequential.
Actually it does. I thought OpenCL takes care of synchronization in such cases. Am I wrong?
You’re right. It doesn’t work like I thought. Thanks.
This is a reduction operation. There are ways to parallelize reductions, but they are obviously more complex than doing it serially. Search for “OpenCL parallel reduction” for ideas. “Divide and conquer” and use fast shared memory for partial results. Save the atomic operations for the last step.