Should the work group size divide the max work group size for best performance?

openagent · December 8, 2014, 6:03pm

On my ATI Radeon HD 6750M
I get 6 max compute units and max work group size of 256.

and it says on docs global size should be divisible by local size.

Say I have 700 as my global size.

So looking at in from a hardware perspective I am under the assumption that you can only sync threads within a single “compute unit”. So shouldn’t the max work group size which is 256 be divisible by the local size for best performance.

Say I pick 4 as local size which divides 700, and also 256 would be divisible by 4. and there will be 175 work groups within a single compute unit. But if i pick say 100 as my work group size. Then again 700 is divisible by 100 but 256 is not divisible by 100. So does that mean I will get 2 work groups per computer unit? and 56 threads in each compute unit will go unused because 256 - (100 x 2) = 56 ?

Thanks!

utnapishtim · December 9, 2014, 8:32am

If you try to execute a kernel with clEnqueueNDRangekernel() with global_work_size = 256 and local_work_size = 100, you will even get an error CL_INVALID_WORK_GROUP_SIZE.

The number of work-items specified by global_work_size must be evenly divisible by the size of work-group given by local_work_size.

openagent · December 9, 2014, 10:10am

256 is the work group size and 700 is the global size so it is evenly divisible. But my question is regarding max work group size and local size in a hardware perspective.

utnapishtim · December 17, 2014, 9:13am

If I understand well, you want to know what happens when global size=700 and local size=100.

Obviously you request 7 work-groups of 100 work-items each.

On AMD devices, work-items in a work-group are processed by wavefronts of 64 threads. So each work-group needs 2 wavefronts. The first wavefront will run all 64 threads, the second wavefront will run only 100-64=36 threads.

Wavefronts of the same work-group run on the same compute unit, so 5 compute units will run 2 wavefronts, the last compute unit will run 4 wavefronts.

If you use a local size = 4, the hardware needs 175 wavefronts and each of them will run only 4 threads with 60 threads staying idle…

It is generally a good practice to set the local size as a multiple of the wavefront size (64 on AMD, 32 on NVidia).

openagent · December 18, 2014, 11:39am

That was what I was after. Makes perfect sense. Thank you!

Dithermaster · December 22, 2014, 3:05pm

> 256 is the work group size and 700 is the global size so it is evenly divisible.
Um, no it’s not. 256 goes into 768 but not 700.
The common solution is to “round up” the global size to be an integer multiple of the work group size, pass the desired size as a parameter, and reject the out-of-range work items in your kernel. This assumes the multiple is large; I don’t think it would work well for small multiples (like 3).
You should benchmark with different work group sizes to find the one that runs the fastest.
Note: In OpenCL 2.0 the requirement for “multiple of” has been removed.