Local dimensions are used for doing synchronization (barriers, mem_fences). Even if your local workgroup size is 1, you can still have plenty of parallelization through your global size. (E.g., if your global dimensions are 1000x1000 on the CPU then you could have a million threads in parallel.) You can take a look at what OS X is doing by watching how many threads are running when you execute on the CPU.
This is a problem if you want to implement reduction algorithms which require synchronization within a workgroup. I believe AMD’s CPU implementation supports workgroup sizes of up to 1024 on the CPU, but it’s not clear to me what benefit (apart from implementing reductions and the like) this would give. On GPUs the workgroup size physically maps to how the kernel is executed so it is very important.