Those are the dimension limits, not the maximum totals. You need to look at DEVICE_MAX_WORK_GROUP_SIZE to find out what you want.
e.g. if the latter is 1024 (which afaict is the current limit for nvidia, for amd it’s i think it’s only 256) you can use 32x32x1, 1024x1x1, etc, but nothing bigger (i.e. they must sum to less than the limit).
Please go read the opencl spec that talks about SM’s and so on as you don’t seem to understand how they fit in to the programming model. If your code requires global sync it has to run on a single SM, use atomics, or be invoked once per iteration: there simply isn’t any other possibility.
PS if your code isn’t broken, why are you asking? Just use it. If it works, what were you timing to know this other version was so much slower?
There is no `limit’ to the size of problem you can solve this way: you can iterate each ‘block’ independently - it just cannot run concurrently.