Some suggestions!


1.) clEnqueNDRangeKernel limits the global_size to be be divisible with local_size, but internally at least AMD can set local size with which global_size is not divisible, if the user does not specify local_size. Suggestion: provide a version of clEnqueNDRangeKernel where number of groups can be specified together with local_size and global_size is not required to be divisible by local_size. Intel even recommends leaving local_size undefined as it will do a better job on a given CPU architecture.
2.) Provide global synchronization. Currently work sets required to be globally synchronized either require a launch of a new Kernel or are only allowed to run on one compute unit (making local synchronization also global) and are thus very size limited.
3.) Reduction algorithm is so commonly applied and needed that I think it would make sense to provide a special kind of a kernel support to simplify code development. Something like:

__reduction __kernel float MyRed(float a, float b)
return a + b;

The goal of every reduction is to reduce the input array(s) in to one output value.
and then:

clEnqueNDReductionKernel(MyRedKernel, …);

clSetKernelArg sets required (input array) arguments. Reduction factor with MyRed is always 2. All the group count and local size arguments are done internally.