Sub-compute units

Hi, I was wondering if there is any way to get how many max processors/stream processors/threads inside a single compute unit. I know that for example in the radeon 4870 card 1 openCl compute unit does not correspond to one thread/stream processor in the 4870. I.e. 1 compute unit can kick off more then 1 thread at a time.

It would be useful to have some sort of mechanism where you could see how many sub-processors are there in compute unit. For example if I have an array of 100 length and I have 10 compute units, how do i know to distribute the array across the 10 compute units. If for example there was 100 threads in 1 compute unit, it would be best to put the whole array into the first compute unit (local work size = 100 & global work size = 100) and use the last 9 for something else.

If I was to divide 100 equally across 10 compute units, each compute units would waste the rest 90 threads for that execution right?

I hope you can see my problem any suggestions to overcome this? Are there any plans to have a device query for max sub-compute units?

I know that some nvidia cards seem to have 1 stream processor corresponding to one openCl compute unit.


I dont think you need to worry about this. Have a look at ‘warps’ in nvidia.

As long as you’ve got way more work-items than total compute cores (which you always should, say on the order of > 2000 work-items) they should be nicely distributed. It’s a secondary optimization to determine the optimal work-group size, and a tertiary optimization to adjust for the specifics of the architecture like this. (I’m not saying it is unneeded, just that it’s not something to worry about until later on.) In general, you want your local work-groups to be multiples of 16 for Nvidia and multiples of 64 for AMD.

I thought it would be 32 on Nvidia GPUs, because that’s the warp size and thus work-items in a work-group are executed in a batch of 32. Don’t you waste resources if your work-group size is less than 32?