Question in local synchronization!


I am looking at this kernel I found (and scratching my head):

__kernel void sqrt_sum(
          __global float * vec1,
          __global float * result) 
  for (unsigned int stride = get_global_size(0)/2; stride > 0; stride /= 2)
    if (get_global_id(0) < stride)
      vec1[get_global_id(0)] += vec1[get_global_id(0)+stride];
  if (get_global_id(0) == 0)
    *result = sqrt(vec1[0]);  

I understand the for-loop. The problem is the synchronization. The barrier is defined to
work only within the same work_group: “All the work-items of a work-group must execute the barrier before any are allowed to continue execution beyond the barrier.”

This implies that work_group size for this kernel to work, must be equal to get_global_size(0) to make sure that only one work group is launched (running on the same compute unit)?

Maximum work_group_size is limited with CL_KERNEL_WORK_GROUP_SIZE to typically 512. This means that all dimensions x * y * z across all work groups cannot exceed 512, but it is possible to have x = 512, y = 1, z =1.

This kernel thus works only for get_global_size(0) of less than 512 (for AMD) with condition that only one work_group of equal size as get_global_size is specified when clEnqueNDRangeKernel is called?

get_global_size(0) == get_local_size(0) //??


I think your analysis is correct. Where did you get this kernel?

Thanks. The kernel is from ViennaCL library.