OpenCL Ndrange Global Size/Local Size


As far as i understand, ndrange global size should be a multiple of local size.
But in case it’s not, how does OpenCL handle it? (better said, does OCL handle it?).

I mean, how many “groups” of size = local_size will be launched.

For example, which one would be right for global size 1000 and local_size 512?:
a) We’ll have 2 groups of size 512.
b) We’ll have 1 group of size 512.
c) we’ll have 1 group of size 512 and a group of size 488.

And… it’s strictly defined by the standard? or it’s implementation dependant.


Well, on AMD implementation it looks like the kernel wont even launch so i think that answers my question :confused:

According to OpenCL specification, clEnqueueNDRangeKernel should fail and return CL_INVALID_WORK_GROUP_SIZE.

It is defined by the standard. You must make it a multiple.

The standard way of dealing with non-multiple desired global work sizes is to use the rounded-up value for clEnqueueNDRangeKernel, but pass the desired global size as kernel parameters, then check for global ID inside the kernel to see if it is inside the desired work size. For example, to process a 1920x1080 image with a 32x32 local work size. Global work size must be 1920x1088. The kernel might look like:

__kernel void Example_Kernel
    __read_only   image2d_t imgSrc,
    __write_only  image2d_t imgDst,
    int       width,
    int       height
    int x = get_global_id(0);
    int y = get_global_id(1);
    if ((x < width) && (y < height))
    ... // do work here

For getting started, you can leave local work size unspecified, and let the runtime come up with one, but if you have odd or prime global sizes, it might use 1x1 which is not optimal.

Yeah thanks, did so :).

I’m working on a opencl “middleware”, so needed to know every possible combination, but it looks than rounding-up works fine. That’s good for me i think, after all it’s the same approach than cuda :).