OpenCL performances on NVIDIA GTX 260 and ATI Radeon HD

Hi, I wrote an OpenCL kernel doing the dot product between two double arrays. This is the code:
_kernel void evaluate_product(__global const double *pFirstArray, const int n,
__global const double pSecondArray, __global double pOutArray)
int gid = get_global_id(0); int size = get_global_size(0);
if (gid>=0 && gid <size) {
double output = 0.0f;
for (int k=0; k<n; k++)
output += pLocal[k]*pSecondArray[k];
pOutArray[gid] = output;

Why this kernel took 30 ms on NVIDIA GTX 260, while on ARI Radeon HD 6900 it took less then 10 ms?
Any ideas? Or some optimization to use in kernel for NVIDIA card?

Are you sure that’s your code? I don’t see how that can compile given that pLocal is never defined. I also don’t see how it can be computing a dot product, given that it outputs an array rather than a single value. I’d suggest you search on the internet for a tutorial about dot products in OpenCL, or just use the BLAS libraries that AMD provide (APPML).