on OS X 10.6.1 I get the following values on a Core2 Duo CPU:


I’d assume having at least 2 concurrent work items here, what am I missing? Is the CPU implementation not yet multi-threaded?


Local dimensions are used for doing synchronization (barriers, mem_fences). Even if your local workgroup size is 1, you can still have plenty of parallelization through your global size. (E.g., if your global dimensions are 1000x1000 on the CPU then you could have a million threads in parallel.) You can take a look at what OS X is doing by watching how many threads are running when you execute on the CPU.

This is a problem if you want to implement reduction algorithms which require synchronization within a workgroup. I believe AMD’s CPU implementation supports workgroup sizes of up to 1024 on the CPU, but it’s not clear to me what benefit (apart from implementing reductions and the like) this would give. On GPUs the workgroup size physically maps to how the kernel is executed so it is very important.

It’s also a problem with low( as in one only) work_group_size for the CPU when you try to debug programs on the CPU using printfs.
On Apple’s implementation, you have a situation where you cannot use the benefits of the CPU in order to debug the code because you
cannot run the same code on both the GPU and the CPU.
It would be better to be able to do local sync with barriers on the CPU also.


Thanks for the feedback, my question was probably not very well phrased, I apologise. I understand that there will be n concurrent workgroups depending on the particular nd-range. What I don’t understand though is why Apple per se limits the max work group size to 1 on the CPU.


I suspect that implementing workgroups of size >1 on the CPU is difficult because you have to have all the threads of execution for each workitem stop at barriers. This means either you have them all be threads (very expensive) or you have to figure out how to play some compiler tricks so you can execute all the workitems on the same thread but have them stop when they get to a barrier and switch in another thread. So it’s not nearly as easy as on the GPU where the hardware takes care of that. I was impressed that AMD does support larger workgroup sizes with their CPU implementation for this reason.

There is a paper on executing CUDA programs on multi-core CPUs:
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

What they do is to execute all “logical threads” in a work group sequentially in a single “CPU thread”. To tackle the synchronization problems mentioned by dbs2 they use loop fission (i.e. they split the loops at barriers).
In contrast to GPUs, having many threads on a CPU is not really desirable as it creates a lot of overhead. Usually you don’t want more threads than there are cores and thus workgroup size > 1 would probably improve performance on CPUs.

Hmm, this is interesting…
But why does my crappy P4 2.8 CPU (ATI OpenCL implementation) report:

CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 1024

My guess is AMD implemented what dominik outlined, but I have no idea why other than for compatibility. (I can’t see a performance win on the CPU for that approach.)