How to perform atomic sums on floats

I am trying to get the following kernel to properly add up the global ids. Of course this is pointless, but it illustrates something I am trying to make work in a larger kernel. Basically my kernels perform a fair amount of calculations, but the end result that I want to get back is a small array of various totals. Performing these totals in a parallel fashion is that does not seem to be working. If I execute the following kernel, with a fixed number of work units, I would like to always get the same result. Say for 100 work units, I would expect 0+1+2+3+ … + 99. However, every time I run the kernel, I get a different number.

Is this what mem_fence is attempting to solve? Or is there some other technique I need to use. The number I total needs to be floating point. I also tried putting the mem_fence

kernel void AtomicSum(
global write_only float* c )
int index = get_global_id(0);
c[0] += (float)index;

You are right, i would expect as well a 1+2+3…+99 ?

did you try barrier(CLK_GLOBAL_MEM_FENCE); instead of mem_fence() ?

The code is not right… and does not perform summation that you want. The problem is in te way that you perform de operation, you are telling to OpenCL that every workItem update the same portion of memory, so when a workItem needs to do the operation, the copy of the portion of memory that reads it can be not the real one, in other words, the operation is not secuentially. Basically, there are two forms of doing this operation, one is assign the operation to ONE workItem, and the other is using some method of reduction, in the last you perform a division of the operation.

No, it is doing what I want it to… I want it to update the same area of memory. Can’t OpenCL share memory, I thought thats what mem_fence was for.

Basically I am trying to write the equiv of:

float sum = 0;
for(int i=0;i<100;i++)

So thats why I am updating the same piece of memory. I perform a complex calculation thousands upon thousands of times. But I do not need to know the individual results. Just the sum. Are you saying I need to allocate a very large buffer and never touch the same piece of memory twice? I got it to work that way, but it takes way too much memory.

No… you don’t understand me. let’s put an example. You have one piece of memory for the result value, and a piece of memory with some values. When you have to sum all this values and return a result value, each workItems read the peace of memory of the result and adds the corresponding value, but… the memory that is reading at that time, was reading by another workItems that are doing the same operation. So, everyone do the sum, and then each one stores his value with an incorrect result!

The atomics operation in OpenCL Specification do this operations in the correct way, in other words, secuentially adds the same portion of memory and update it. They are slow.

Another thing, the barriers are use for synchronize the work in workItems of the same workGroup. But the order that the workItems execute after the barrier it’s not specified, so the barrier in your code is wrong, it’s telling that only can go on with the execution, only if all workItem in the same workGroup perform ALL the operations behind de barrier.

You can use atomic_cmpxchg function and C union to achieve it for floating point

you can implement several “reduce” steps in your program to aggregate large dataset in parallel manner to avoid concurrence and produce result on last “reduce” step