clEnqueueNDRangeKernel max global_work_size

matrem · December 3, 2009, 1:33am

Is there another maximum than 2^address_bits for the global_size?

I try to pass a bit more than 100 000 000 and clEnqueueNDRangeKernel return INVALID_VALUE.

I work under vista64 with last nVidia drivers.

dbs2 · December 3, 2009, 2:09am

No. The spec puts no limits on the size except for the size of the type holding the data, which is a size_t. The OpenCL implementation is responsible for breaking up your requested global size into something that will work on the hardware. So if your size is not being accepted it sounds like a bug with the Nvidia implementation.

I would suggest you make sure that your global size is a multiple of 32, though. Otherwise the driver may be forced to use a local size that is non-optimal for the card and you will get substantially worse performance.

matrem · December 3, 2009, 2:23am

Thanks.

To be accurate I use (134217728,1,1) for the global and (512,1,1) for he local.

Without any nVidia account, is it possible to report OpenCL driver bugs somewhere? Perhaps nVidia guys read this forum?

dbs2 · December 3, 2009, 3:50am

Did you verify via clGetKernelWorkgroupInfo that that local size is valid for your kernel? Since that’s the maximum the hardware supports, it will only be okay if your kernel is using very few registers. (Or you can just pass in NULL for the local size.)

matrem · December 3, 2009, 6:50am

Very interesting, actually I put 512, the max of my device capability. But now I’ve tried to let the implementation choose the right local size and the problem is the same.

If we use to much register, should the return not be CL_OUT_OF_RESOURCES ?

dbs2 · December 3, 2009, 10:20am

You have to put in a number for the total local workgroup size that is less than or equal to what is returned by the clGetKernelWorkgroupInfo call. I believe you should get CL_INVALID_WORK_GROUP_SIZE if the size is not valid.

matrem · December 3, 2009, 10:38am

Thank you very much, I forgot this “clGetKernelWorkgroupInfo” call… It’s just essential.

matrem · December 15, 2009, 1:56am

In fact I have an invalid_value as soon as the ratio is over 65535 … I guess this ratio is stored by drivers in a too short variable :)…

dbs2 · December 15, 2009, 4:56am

That sounds like a hardware limit that the Nvidia driver is not handling correctly. I would suggest filing a bug against them if you can.

matrem · December 15, 2009, 6:47am

I can’t with no account (I just retried to create one…).

195.181 nVidia beta drivers don’t fix the problem.

Can it be possible to create a zone on this forum to report implementation bugs, that manufacturers could read?

matrem · December 17, 2009, 7:51am

In NVIDIA OpenCL guide we can read :

The maximum size of each dimension of a grid of thread blocks is 65535;

So there is a limitation, fixed by nVidia, which is’nt define by specification and that we can’t query with API… This constraint limits a bit the portability

dbs2 · January 4, 2010, 12:19am

It is strange that you can use arbitrary sizes on Nvidia cards on MacOS X but not with Nvidia’s own drivers. All they would have to do is breakup the runtime kernel execution into 65k sized chunks since each execution is necessarily independent.