Workload in Ati GPU. How Much work-items i have to put?


In Programming Guide - ATI Stream Computing OpenCL™ in a example of ATI Radeon HD 5870 have 20 Compute Units, each with 16 stream cores and each of that, having 5 Processing Elements, yeld a 1600 Processing Elements.

But, in other parts say that have the notion of WaveFronts, indicating that have to put more than workload than stream cores, at first because he run a VLIW instruction.

How much work-items i have to put to execute?

Considering only in number of the stream cores (320) or considering all (1600)?



For starters, that stuff is wildly out of date, use the AMD APP programming guide - OpenCL. It hasn’t been called ATI Stream Computing for nearly 2 years.

The only important number regarding work items is the 20 compute units, and that they have 64-wide wave-fronts (i.e. you need 64 work items to avoid wasting compute resources).

i.e. if you had a work size of global=20*64, local=64, you would be running something on every compute unit. You then want some multiple of that (either a larger local size, 128 or 256 or many more 64-wide groups) so that multiple wave-fronts execute concurrently which hides memory latency (think of ‘hyper threading’).

The 5 PE’s are per work-item, i.e. per stream-core. The 16 stream cores take 4 cycles to retire 1 instruction but have 3 others in flight, which gives you the 64-wide wave-front. You have no direct control over the PE’s, they are just used by the compiler to do more work in less instruction slots. You can help by unrolling loops or doing vector operations, but it’s always a trade-off.

Usually the number of work items you use is data-dependent and independent of the hardware layout. e.g. use 1 work-item per-pixel for graphic tasks.