maxComputeWorkGroupInvocations is very low, even on an RTX 4060 Ti

Hi,

I’m trying to use a compute shader with X, Y dimensions, but I’m limited by the maxComputeWorkGroupInvocations, which is 1024, even on a GeForce RTX 4060 Ti. This means I can only work on a maximum of 32x32 (1024) compute shader invocations.

For example in a GLSL compute shader:
layout(local_size_x = 32, local_size_y = 32) in;

I encountered this 1024 limit on a lower-end GPU as well, so I specifically bought an RTX 4060 Ti for this purpose, but it’s still limited to 1024. I’m wondering if this is normal or if it’s an issue with the Vulkan API.

Regards

See here for how common this limit value is.
The maxComputeWorkGroupInvocations value puts a limit on the local group size (product of the local_size_{x,y,z} values in the shader). You dispatch multiples of these with the command vkCmdDispatch, so you can dispatch much larger numbers of compute shader invocations with one command (up to the limits given by maxComputeWorkGroupCount[0,1,2]).
Of course invocations within the same local group have access to certain resources (e.g. shared memory) and communication/synchronization functionality that is not necessary available between invocations in distinct local groups. I couldn’t tell if that is an issue for what you are trying to do.

The concept of “local work groups” (under various names) exists in other compute APIs as well and I would expect this to be something driven by hardware limits - in other words I would expect other APIs to have the same limits on the same hardware.

Thanks for the information and the graph. If I understand correctly, I should use group dispatch rather than a single local group, as maxComputeWorkGroupCount is much higher (X: 2147483647, Y: 65535, Z: 65535). It’s strange that Y and Z are lower than X, though. I was using a single local group to map a neural network, so I could use X and Y indexes more easily. It seems I need to rethink my solution then.