ATI and NVIDIA from OpenCL perspective

I have some experience in NV CUDA and recently switched to OpenCL. So far I always targeted Nvidia architecture and optimized my kernels accordingly: coalesced memory access, no divergent branches inside warps, avoiding shared (or local in OpenCL terminology) memory bank conflits etc.

Now I would like to write OpenCL kernels in such way to achieve optimal performance on both NVIDIA’s and ATI’s architectures - is it even possible?
I don’t know ATI architecture, ATI Stream, never used it. Is it similar to NVIDIA? Does both require from programmer the same optimization techniques? What are the main differences?
Thank You!

I think in many respects the architectures are similar. Both, ATI Stream and NVIDIA CUDA are SIMD architectures, i.e. divergent branches are expensive. Also memory coalescing is important on both architectures.

The major difference is probably that ATI is vector-based, whereas CUDA is scalar-based. So to get the best performance on an ATI GPU, you want to vectorize your code.

In terms of local memory, they’re pretty much the same. Although the trade-offs (e.g. when is it worth loading data to local memory) might be different…

There’s an OpenCL programming guide for ATI GPUs which explains things in more detail.

In general I’d say that it’s not possible to write a kernel that achieves the optimal performance on both architectures. However, it may still be possible to get good performance on both architectures with the same kernel…