I’m getting a great support here on forum, but just a moment ago I encountered situation which is confusing so just started another thread.
Theoritically I should be able to run my kernel with 512 work items in workgroup. But it seems that when I specify local_work_size=512 for clEnqueueNDRangeKernel then no work is done and nearly 10 000 runs of my kernel are done surprisingly fast. When I decrease value of local_work_size to 256 kernel runs normally and performs computations.
Theoritically I should be able to run my kernel with 512 work items in workgroup.
I believe you are confusing the maximum work size supported by the device with the maximum work size that can be used to run a particular kernel. For a very complex kernel, the maximum work size will be smaller than the maximum that the device can support for a very simple kernel.
That’s why in OpenCL there are two different queries. One of them is clGetDeviceInfo(…, CL_DEVICE_MAX_WORK_GROUP_SIZE, …) – this is the maximum for the device. The other one is clGetKernelWorkGroupInfo(…, CL_KERNEL_WORK_GROUP_SIZE, …) – this one is the maximum value you can pass to clEnqueueNDRangeKernel() for this kernel.
Is CL_KERNEL_WORK_GROUP_SIZE guaranteed to be less than or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE, or would one have to explicily take the minimum of the two?
Technically speaking, I don’t think the spec guarantees that CL_KERNEL_WORK_GROUP_SIZE must be less than or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE.
However, the spec describes CL_KERNEL_WORK_GROUP_SIZE as the “maximum work-group size that can be used to execute a kernel on a specific device
given by <device>”, and that’s all the application cares about.
CL_DEVICE_MAX_WORK_GROUP_SIZE is pretty much irrelevant.