I have an AMD Radeon HD 7970 card. The specs say that it has 32 compute units of size 32 each. When I query the CL_DEVICE_MAX_WORK_GROUP_SIZE value it indeed returns 32.

Recently, I was troubleshooting performance on a kernel I wrote using CodeXL from AMD. On one of the screens it said that the device is not fully utilized and also that the device limit for the maximum workgroup size is 256. I then queried the CL_KERNEL_WORK_GROUP_SIZE value and indeed it said 256.

I then increased the work group size from 32 to 256 when calling clEnqueueNDRangeKernel and it now runs about 4 times faster than before.

I don’t understand this. If the device is physically limited to 32 then how can it execute 256 work items?

From what I have seen, CL_DEVICE_MAX_WORK_GROUP_SIZE is 256 on a HD7970.

HD 7970 (codename “Tahiti”) has 32 CUs, each has 4 SIMD units. The SIMD unit is 16 ALUs wide, however the “logical” size of this SIMD is 64 workitems - the same decoded instruction executes for 4 cycles. (the explanation why the architecture is built this way is long…).

The meaning is that you should work with workgroup size which is a multiplication of 64 - 64, 128, 256, etc. Basically, it could do more (512, 1024) but was limited due to other reasons. Any non-64-multiplication will lead to under-utilization of the SIMD.
It looks weird to me that the value you got for CL_DEVICE_MAX_WORK_GROUP_SIZE was 32 - maybe you read a different field by mistake ?

If you want to read on this architecture, I recommend looking for articles/presentations which talk on GCN. (Tahiti was the first GCN Based GPU)