Hi all
According to spec 1.0
-
CL_DEVICE_MAX_WORK_ITEM_SIZES
: Maximum number of work-items that can be specified in each dimension of the work-group to [clEnqueueNDRangeKernel
] - in my case this is 24
According to spec 1.1
-
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
: Returns the preferred multiple of workgroup size for launch. This is a performance hint. Specifying a workgroup size that is not a multiple of the value returned by this query as the value of the local work size argument to [clEnqueueNDRangeKernel
] will not fail to enqueue the kernel for execution unless the work-group size specified is larger than the device maximum - in my case 32 (analogous to warp size in CUDA)
Now how can clEnqueueNDRangeKernel()
have local a multiple of 32 whereas at the same time global that is supposed to be a multiple of local is less than 24. I certainly missunderstood something…
- The only combination I can work is {global = 32, local = 32}. I presume this is not efficient at all ?
- Is {global = 32 * 24 local = 32} correct ?
- Is {global = 24, local = 32} correct ? (that is we do not have to set global a multiple of local as reported elsewhere)
- Why some people reported that global must be evenly divisible by local, is that true ?
If anybody could tell me what I miss here, that would be a great help !
Thanks in advance
best regards