Performance Questions with regards to image processing.

Hello all! I am trying to write a FAST (Corner detection Algorithm) function in opencl, but I am finding that just copying the memory to the OpenCl buffer and running an empty kernel is taking 1-2 milliseconds, I feel like I am doing something wrong (Im pretty new to OpenCL) but I’m just stumped, I was hoping someone could give me some direction or pointers.

    clEnqueueWriteBuffer(commands, input, 
                            CL_FALSE, 0, DATA_SIZE, 
                            Image->data(), 0, NULL, NULL);
    clEnqueueWriteBuffer(commands, outputSize, 
                            CL_FALSE, 0, sizeof(int), 
                            numResults, 0, NULL, NULL);

    //Stride of image Data
    clSetKernelArg(kernel, 3, sizeof(unsigned int), &Stride);
    clSetKernelArg(kernel, 5, sizeof(unsigned char), & Threshhold);
    clSetKernelArg(kernel, 6, sizeof(int), &Height);
    ErrorCheck(err, "Error: Failed to set kernel arguments! ");


This particular piece of code is taking .5-4 milliseconds (usually closer to 1) with the exact same sized data every time (a byte array of a 1280X720 Image), which is troubling because the single thread cpu function to process it takes 1 millisecond to do the whole fast algorithm. Am I just not going to be able to match the speed of the CPU processing it? Or am I just passing data around wrong? Id be glad to post any other pieces of code that may be relevant I just didn’t want to flood the thread with my whole function XD

To compare execution time you have to mesure the kernel time only using the event system. if you compare the whole process on cpu and gpu, cpu implementation might be faster for a small problemsize because the pci-e communication to the gpu takes some time too. one way to varify that your gpu works fast is to increase the problem size.
Don’t try to compare 100x100 px images on cpu and gpu. there is to much constant overhead for the gpu to win that race