Optimize global and local size vs WORK_GROUP_SIZE_MULTIPLE

Hi all

According to spec 1.0

  • CL_DEVICE_MAX_WORK_ITEM_SIZES : Maximum number of work-items that can be specified in each dimension of the work-group to [clEnqueueNDRangeKernel]
  • in my case this is 24

According to spec 1.1

  • CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE: Returns the preferred multiple of workgroup size for launch. This is a performance hint. Specifying a workgroup size that is not a multiple of the value returned by this query as the value of the local work size argument to [clEnqueueNDRangeKernel] will not fail to enqueue the kernel for execution unless the work-group size specified is larger than the device maximum
  • in my case 32 (analogous to warp size in CUDA)

Now how can clEnqueueNDRangeKernel() have local a multiple of 32 whereas at the same time global that is supposed to be a multiple of local is less than 24. I certainly missunderstood something…

  1. The only combination I can work is {global = 32, local = 32}. I presume this is not efficient at all ?
  2. Is {global = 32 * 24 local = 32} correct ?
  3. Is {global = 24, local = 32} correct ? (that is we do not have to set global a multiple of local as reported elsewhere)
  4. Why some people reported that global must be evenly divisible by local, is that true ?

If anybody could tell me what I miss here, that would be a great help !
Thanks in advance
best regards