Please explain CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Reaver · October 9, 2013, 11:06am

Hi, everyone.

I’m pretty new to openCL and gpgpu.

I’ve read in different places that CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE is related to or similar to warps and wave fronts, but isn’t quite the same thing. Could someone explain the difference or point me to an explanation?
Also, is there another way of querying for the “warp size” on a non AMD or Nvidia device?

(I’m sorry if this is a done to death thread, but I wasn’t able to find anything when searching the forum)

Bilog · October 12, 2013, 4:26am

The preferred wg size multiple is what the OpenCL platforms thinks the local workgroup size should be a multiple of to achieve optimal performance. On NVIDIA GPUs, this is always returned as the warp size, and on AMD GPUs this is always returned as the wavefront size, because workitems are always dispatched in warps/wavefronts at the hardware level, so having a local wg size which is not a multiple of the warp/wavefront size will just waste resources.

However, on other hardware (e.g. on CPUs) and/or with different compilers (e.g. vectorizing compilers) the actual preferred workgroup size multiple can be kernel dependent. For example, a kernel vectorized by hand that makes excellent use of the SSE/AVX instructions on x86 CPUs might get a preferred workgroup size multiple of 1 (since it fills the vector width of the CPU appropriately), while a kernel which is strictly scalar might get a preferred workgroup size multiple of, e.g. 4 or 8, so that the kernel dispatcher may coalesce multiple workitems in each workgroup to fill all the vector lanes of the CPU.

Regarding your other question: there are extensions to the device information fields both by AMD (CL_DEVICE_WAVEFRONT_WIDTH_AMD) and NVIDIA (CL_DEVICE_WARP_SIZE_NV) which you can use to query the wavefront/warp size, but their use is discouraged in favor of the kernel property.

Reaver · October 28, 2013, 10:00am

Thank you for the insightful answer! I really appreciate the example of when this might return different values dependent on kernel implementation.