I just compared the Intel implementation against that from AMD and it turned out that my code runs faster with the AMDAPP SDK, even though I have Intel CPUs (PC and Laptop). The Intel compiler says that the kernel couldn’t be vectorized but the AMD compiler does not vectorize neiter (as far as I know). Where’s the problem?
Perhaps you could post your kernel to see how it compares across other implementations? I’ve seen plenty of disparity between the two SDKs but usually in the other direction…for example latency for the clEnqueueWriteBuffer using the AMDAPP SDK on Intel CPU is significantly higher.