Optimisation tips for fetch intensive kernel on ATI

Dear OpenCL users, I recently ported a kernel from CUDA to OpenCL.
This kernel process a 2D image (~512²) and for each pixel, fetch ~8000 coordinates in global memory.
Then for each pixel it will fetch ~8000 times in the 2D image using this coordinates.

The profiler says the bottleneck is mem fetches, not ALUs
On Nvidia 570, kernel has identical performances in CUDA or OpenCL
When running a Radeon 7850 (I think performances should be close to the GTX570), code is 5 times slower.

I changed my code to use shared memory and reduce the amount of global memory fetches.
Now the profiler says the bottleneck is ALU Ops.
But the 7850 is still 2.5x times slower that the GTX570.

Any tips regarding:

  • the reason why ATI is slower for this kind of kernel
  • optimization of this Kernel for ATI (my coordinates array is constant for all kernel launches)

PS: the 2D image is in fact a 32bit greyscale pic.
I’m currently using a CL_R - CL_SIGNED_INT32 image format.
Could this explain bad performances of my read_imagei() calls?

PPS: I changed this to a CL_ARGB, and updated the kernel to handle 4 consecutive pixels. Same performances :frowning:

Thanks a lot for your help

I had one bit of code - a face detector - which had a big performance hit on AMD hardware, going from GTX 480 to HD 6950 was of a similar order to yours.

And I think with some work I ended up about 2x faster than that - but still 3x slower than the 480.

I can’t remember the details fully but I think it was a fight between memory bandwidth and empty slots in the VLIW instructions (and too many non-unrollable loop/branches). IIRC i couldn’t try to do more work per work-item because the memory bandwidth was already saturated, but just doing one lot of work left a lot of empty alu slots.

I’ll be getting a 7970 soon so i’ll see how that compares. I expect it should ‘solve’ the `problem’, but it might not. This was only hobby code - fortunately none of my work stuff suffered and some ran faster.

Apart from local, trying to use the constant cache is about the only other big thing to try. If it’s alu bound, remove loop unrolling or work with narrower data-sizes (although you probably already are).

PS actually, I did had some monsterous performance (and correctness) problems with quite a lot of code, but those were all from using #pragama unroll which triggered some bugs in the amd compiler at the time (which may be fixed now?): removing them ALL was the easiest solution.

Thanks for your answer.
So what would be your explanation in terms of hardware difference? Are the NVidia cards taking less cycles per global read? Or is it the cache system that is better?

Many things can cause such a performance fallback.

AMD (or ATi) cards suffer when switching between FETCH and ALU clauses on the binary level. Try to group your FETCH, ALU,and WRITE operations together respectively. This allows the compiler to reach higher VLIW pack ratio. This efficiency can be queried with offline Kernel Analysis tool also, as well as the runtime AMD APP Profiler, which can shed a lot more light into why your kernel performs poorly.

Cache hit ratio could also be looked at as a source perf degradation. Reorganizing your work-item’s READ/WRITE operations might help that.

If you have to switch between clauses too often and memory operations cannot be hidden by multiple wavefronts in a work-group, you will get a high ALU Stall ratio.

Using only a single channel is not the best case scenario, since texture fetch units are tuned for 128-bit loads, but there are some optimizations for scalar 32-bit operations starting from HD6xxx.

Just a few ideas. Download AMD APP Profiler and profile your program to find the real bottleneck. Let us know if any of these helped.

Thanks for your answer.
I will definitely run the profiler asap.
I’m still compiling with CUDA SDK, do I need to compile with AMD if I want to use the profiler?

Regarding fetch/ALU switches, is this considered has a bad grouping?

for( int l=0; l<loclsize; ++l )
    sum += read_imagei(inData, sampler, (int2)(x+shared[l].x, y+shared[l].y));

Regarding single channel, I tried rgba (see my PPS in the original post) without luck.

If you are running on AMD GPU you surely have an AMD OpenCL runtime installed, so that will do for the profiler. The tool will install a tab inside Visual Studio somewhere next to your Solution Explorer. You will have to launch your application there and run until a few of your kernels have been launched.

Try to dump kernel binary (or just compile your kernel with the AMD APP Kernel Analyzer) and see how long the FETCH and ALU clauses are. I fear this is will simply alternate the two, which is quite painful. Using somewhat more registers could help, but indeed, reduction cannot be done many times faster.

I do not know how often you have to do this, or what percentage of the run time is this summation, but do take a look at ISA binary.

Here is the result of the profiler.
I can’t get the profiler to output occupancy. This option is checked, but the collumn is not present in the result table!

I packed my fetch by 8 and saw a +50% increase in performance.
However if I pack 12 or 16 of them, it’s slower.

                for( int l=0; l<loclsize/2; l+=8 )
                    int t0x = x+shared[2*l];
                    int t0y = y+shared[2*l+1];
                    int t1x = x+shared[2*l+2];
                    int t1y = y+shared[2*l+3];
                    int t2x = x+shared[2*l+4];
                    int t2y = y+shared[2*l+5];
                    int t3x = x+shared[2*l+6];
                    int t3y = y+shared[2*l+7];
                    int t4x = x+shared[2*l+8];
                    int t4y = y+shared[2*l+9];
                    int t5x = x+shared[2*l+10];
                    int t5y = y+shared[2*l+11];
                    int t6x = x+shared[2*l+12];
                    int t6y = y+shared[2*l+13];
                    int t7x = x+shared[2*l+14];
                    int t7y = y+shared[2*l+15];
                    sum += read_imagei(inData, sampler, (int2)(t0x, t0y))
                         + read_imagei(inData, sampler, (int2)(t1x, t1y))
                         + read_imagei(inData, sampler, (int2)(t2x, t2y))
                         + read_imagei(inData, sampler, (int2)(t3x, t3y))
                         + read_imagei(inData, sampler, (int2)(t4x, t4y))
                         + read_imagei(inData, sampler, (int2)(t5x, t5y))
                         + read_imagei(inData, sampler, (int2)(t6x, t6y))
                         + read_imagei(inData, sampler, (int2)(t7x, t7y));

What kind of workgroup size and topology (2D shape) are you using? When reading images, it’s typically much faster to have a square-ish shape (like 16x4) than a linear shape (64x1).

I run a few different work size and take the best score.

It’s mostly the ALU packing on the VLIW units, and the way branches/fetches work (clauses). If you get bad ALU packing you can lose a lot of performance, and some code just can’t be changed to improve it.

This is why GCN departed from the VLIW ways of it’s predecessors - it should still be fine for graphics, but will help a lot for some non-graphics code.

My 7850 (Pitcairn) is using GCN, right?

My 7850 (Pitcairn) is using GCN, right?[/quote]
ahh yeah, sorry missed that.