Adding up a large array.

If at the end of a computation I want to add up all the elements in a one-dimensional array A, how would I best go about doing this?

I figured I could do this in O(log(n)) time, by halving the field repeatedly and doing
A[i] := A[i] + A[i + n/2]
for all i.

Here is the idea (serial) in pseudocode (for simplicity I’m assuming the size of the field is a power of two):

n := n / 2
while (n > 0) do
  for i = 1 to n do
    A[i] := A[i] + A[i+n]
  n := n / 2

I want each iteration of the for loop to be done by one work item, however, I am not sure which of the following two approaches would be best or most idiomatic here:

[li]Enqueing the kernel log(n) times with an ever decreasing work size.[/:m:18gjs991][/li][]Putting the while loop (with log(n) iterations) inside the kernel and having some sort of synchronisation point inside the loop so that the additions don’t conflict with each other.[/*:m:18gjs991][/ol]

NVIDIA has a really good whitepaper on reduction optimizations. Although it‘s for CUDA the concepts directly transfer to OpenCL and AMD has a helpful comparison chart of terms and functions between CUDA and OpenCL.

As Sean said, what you describe is called a parallel reduction.

Thank you for the replies.

That term was the pointer I needed.