It is hard to tell without extra code, but most likely your kernel uses so much resources (local memory, registers…) per work item that a local work size of 1024 is not workable.
There are actually multiple hardware limits that restrict the scalability of GPU programs in terms of local work size. The most common are :
[li]A hard limit on the local work size, for any kernel, both globally (CL_DEVICE_MAX_WORK_GROUP_SIZE) and across each dimension (CL_DEVICE_MAX_WORK_ITEM_SIZES).[/li][li]Local memory usage per work-group (CL_DEVICE_LOCAL_MEM_SIZE), which usually increases with local work size because most programs consume a fixed amount of local memory per work-item.[/li][li]Registers per work-group (Not specified by standard OpenCL, though apparently you found an NVidia extension which tells it), again programs consume a fixed amount of registers per work-item.[/li][/ul]
Also, note that beyond that, it can sometimes also be good for performance to ensure that a given compute unit runs as many work-groups as possible. This is, again, dictated by work-group resource consumption.
Summary : If you want to use a local work size of 1024 (which may be important or not, most programs run better with a local work size slightly smaller than the maximum allowed by the device), it seems you will need to find a way to either use less shared memory (this is something YOU control directly) or less registers (this is more compiler-dependent, but usually the more complex the kernel and the more private variable it has, the more registers it will need). The NVidia profiler can be used to tell you which is the bottleneck, but knowing NVidia’s usual track record when it comes to OpenCL, I wonder if there is a way to make it work for you here, or if it will tell you to get lost because you’re not using CUDA.
Note that this does not, of course, limit the global work size you can use in and of itself. You can very well have a global work size of 512 000 000 with a local work size of 512, as long as you do not exceed one of the device’s global resource consumption limits (e.g. global memory). Note also that changing the global work size does not, in and of itself, increase program parallelism: as soon as you fully occupy all of your device’s compute units, your program is as parallel as it can be, and extra performance is usually achieved by things like loop unrolling, reducing parallelism in order to more closely match your device capabilities, resolving local memory bank conflicts, etc.