comparison of OpenCL implementations


is it possible to compare different OpenCL implementations, i.e. from NVidia, AMD and Apple? Are there any known differences in terms of performance?
If, for example, Apple has its own OpenCL compiler for GPUs, shouldn’t the compilers from NVidia and AMD perform better on their respective devices?

It might be not possible to accurately answer this question, but maybe someone has some experience with different implementations.

Thanks in advance

I do not have an extensive list, but here is what I do know from first hand experience.
Apple’s enumerates all devices present (CPU+GPU) so you can do true heterogeneous computation.
The Windoze environment is in dire need of the ICD model (or equivalent).
NVidia’s has some nasty bugs in the compiler (it chokes on commented out code for example, and the error messages are horribly misplaced).
You need to hot-rod the NVidia SDK with the ATI headers and static libraries or you will get runtime function signature mismatches.
I last used the ATI SDK when it was all CPU based (In October), so I cannot comment on the newer hardware based SDK.

So far, I have not been too impressed with the performance with any of them, this could be more due to the fact that I am still a newb than anything else. Although given how new and immature OpenCL is, maybe the responsibility lies in multiple places…