Convolution Example/Tutorial from AMD

Udeepta Bordoloi at AMD has posted the following convolution tutorial for OpenCL: … penCL.aspx

The tutorial focuses just on the CPU, but includes a nice description of how to vectorize your kernel. There is also a performance comparison to OpenMP. Unfortunately the example does not include the use of local memory which is really important for performance on the GPU, but it’s a good place to look for a non-trivial OpenCL example program.

Agree about local memory…it is on my to-do list.

A friend of mine suggested that it would be better to vectorize by processing N pixels at a time, rather than vectorizing for each pixel. This would also allow you to use 16-length vectors and let the compiler take care of mapping it to the right size for the hardware.

The problem with that is that it all depends on how your data is stored in memory. Assuming your colour components are interleaved (as is normal), then reading 16 pixels of red into a single vector will require gathering from non-contiguous locations, and similarly writing will require scattering the write.

I suspect there would be a penalty for that on various architectures.

Now vectorising to do n * m-component pixels in a single vector (i.e. 5 * 3-component or 4 * 4-component in a vec16) might get you the best of both worlds.