Parallel Reduction combined with other Operations

I am trying to write a function that performs a few vector operations, which are followed by a reduction operation.

However, I am confused how I go about incorporate the parallel reduction example(using work groups). Or if the reduction operation is correct in this case?

Kernel code is a hack, as I tried to incorporate a reduction example into an existing vector routine.

__kernel void vSubRed( __global float * a, __global float * b, __global float * c,__local float * temp,  __local float * local_sums, __global float * partial_sums, const unsigned int count)
   int i = get_global_id(0);
   int num_work_items = get_local_size(0);
   int local_id = get_local_id(0);
   float accum = 0.0f;
   float sum = 0.0f;
   float8 partial_sum_vec = 0.0f;
   int jstart = 0;
   int jend = 0;

   if(i < count)
       a[i] = b[i] - a[i];
       c[i] = a[i];
       temp[i] = a[i] * a[i];

       jstart = (group_id * num_work_items + local_id) * count;
       jend = jstart + count;

       for(int j = jstart; j < jend; j+8)
            // Not sure how to assign the values from the vector temp to the float vector
           // there is probably a more elegant way than below
            partial_sum_vec.s0 = temp[j];
            partial_sum_vec.s1 = temp[j+1];
            partial_sum_vec.s2 = temp[j+2];
            partial_sum_vec.s3 = temp[j+3];
            partial_sum_vec.s4 = temp[j+4];
            partial_sum_vec.s5 = temp[j+5];
            partial_sum_vec.s6 = temp[j+6];
            partial_sum_vec.s7 = temp[j+7];

           // Accumulate in parallel the values into a float8
           accum += partial_sum_vec.s0 + partial_sum_vec.s1 + partial_sum_vec.s2 + partial_sum_vec.s3 + partial_sum_vec.s4 +
                         partial_sum_vec.s5 + partial_sum_vec.s6 + partial_sum_vec.s7;

       local_sums[local_id] = accum;

       if(local_id == 0)
            for(int k = 0; k < num_work_items; k++)
                sum += local_sums[k];
           partial_sums[i] = sum;

Thanks in advance


There’s a couple of problems with this code that I see:

  • If temp is a local memory region, you probably never want to index it with get_global_id(0) or i in your code. Every work-group gets its own copy of local memory, so you don’t index it with a global or group ID. You typically index it with a local ID (i.e. get_local_id(0)).
  • If you have a work-item that uses an element of local memory being writen by another work-item, you’ll want a barrier. I’m looking specifically at the line that reads “temp[i] = a[i] * a[i]” and the following loop that reads it.
  • The population of the 8-wide vector partial_sum_vec would probably be more efficient if it was written like this: “partial_sum_vec = (__local int8)(temp + j);”. But this will depend on the compiler.
  • The structure of your reduction isn’t quite right. There’s a number of ways to do this, but maybe you wanted each work item to be responsible for 8 elements of the input. If that’s the case, then perhaps temp should be a private array instead of a local array. Then each work item will have to load 8 elements into it. And then the advice in my first and second bullets is moot. :slight_smile:

Hope this helps!