Normal execution time??

I’m working in an algorithm using OpenCL and I need to measure the execution time of it in its parallel and sequential versions. Due to this, I’m using an external loop to iterate both codes and measure their times but I have obtained:
Sequential: 3.06 segs
Parallel: 269 segs

The code that I’m using for the parallel version is:
for(i=0; i<N; i++) // N is really big, around a million, but is the same for both versions
fitness = 0;
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, NULL, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vdistance, CL_TRUE, 0, siz_mem_distance_code, distance_code, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vsumatorio, CL_TRUE, 0,siz_mem_sumatorio, sumatorio, 0, NULL, NULL);
fitness = (1/(*sumatorio)) + (*distance_code/12) + ((pow(*distance_code,2))/4) + ((pow(*distance_code,3))/6);

Before this piece of code, I have created/initialized all the things that we need to run a program using OpenCL ( platform, devide, context, queue, buffer, kernel,…) and after this code, I release everything.
I have checked that this increase of time is due to read in each iteration both variables ( distance_code and sumatorio) but I must to do it because I have to obtain the fitness value which is a sequential instruction and can only be excuted when the kernel has finished, so… Could you help me? What am I doing wrong?
I hope to have explained myself properly, thanks in advance.
Best regards,

To get the correct ecxecution time on the GPU only, look up events in the spec. Thats the correct way tto time a kernel. If you need the exact clocks, remember that the gpu mostly have a defined resolution.

Thank you for your reply clint3112!!
I have forgotten to tell something like:

  • I’m not working with GPU, I’m only working with CPU.
  • And to measure the execution time of both codes, I use:
    t_start=clock(); /* Start measuring time /
    for(i=0; i<N; i++)
    // Parallel code
    t_finish=clock(); /
    End measuring time */

When you want to get the execution time including all overhead that the openCL calls generates, this would be the perfect solution. If you want to see if openCL coputations are better optimized than yours, you should still sum up the event times.
Do you execute both calculations in that loop? yours and the calculations from openCL? because it could be, that openCL has lesser priority than your code, so the execution would start when the cpu is free from your code but the time runs all the way.

Thanks for your answer.
For both calculations, you mean parallel and sequential code or the value of both variables?
About the variables, I calculate both inside the kernel and I execute both kind of code separately in different terminals.