clEnqueueKernel with implicit or or explicit local_size

I don’t know why I didn’t think about this sooner, but I would like a second opinion on this method.

Often times global_size for a clEnqueueKernel() is not an even multiple of preferred_work_group_size_multiple, or even worse it’s a prime number, which is a problem if you explicitly pass the local_size. Option one is to pass the global_size and pass NULL for local_size. How this is handled is implementation dependent. Option two is to pad global_size such that it’s a multiple of local_size (which is a multiple of preferred_work_group_size_multiple) and pass the original desired global_size as a kernel argument and do bound checking, ignoring that extra padding. I had been doing this but started to think it’s really wasteful to do bound checking when only a tiny fraction of work-items should be ignored.

Option three recently dawned on me as I became more familiar with OpenCL, enqueue the kernel with zero offset and global_size_a equal to the greatest multiple of local_size_a (which again is a multiple of preferred_work_group_size) less than or equal to global_size, and then enqueue the kernel again but with global_size_a offset, global_size_b = global_size - global_size_a, and local_size_b = global_size_b (which is less than local_size_a). Then there’s no more need for explicit bound checking inside the kernel like in option two.

Maybe option one does option three internally, but option three gives the developer the choice of local_size_a and local_size_b. Which option do you think is the best? Are there any problems or issues I didn’t mention with any of these options?

Launching an extra job is going to be >>> slower than a simple explicit bounds check inside the kernel. Small jobs are (relatively speaking) even worse: most of the SMs on the card will be sitting idle, and even the one SM doing some work probably wont have much to do and might not be able to hide any latencies.

“branches are expensive” is a relative term, if it’s just a single either/or-exit for all threads the cost is negligible. At worst the GPU just masks the results from that thread until exit (which costs == nothing extra), and at best it can retire (some of) the threads for use by other workgroups.