Measuring, and discrepancy of sequential and parallel times


When i collect my time in my application, the parallel time with 2 Threads run in CPU is 0.89 seconds, but the sequential time is 7.813 seconds.
Both codes are exactly the same, with the same data structure.

I already debug for have sure that the execution is really going in the correct way, and all run well.

For collect the time i put both, the call of the Kernel and the call of function in sequential code, in a for with 10000 interactions.

And for to measure only a execution, without compilation time, i put the function bellow, before and after for.
clock_t tempo_execucao_real_inicial = clock();

Do have OpenCL some treatment that optimize memory access or a generating code optimized that generate results with this discrepancy?

Very Thanks,

Somebody know about that? Or who or where can i find about that?
I really need the answer about that, all my research and time that i spent in my research depends of undestand what’s happening.

Very Thanks,

Based upon what you have provided I am unable to answer your post. Please post a link to sample code that shows your application. If you application is private then include a simplified sample. Please include both host and kernel source with your timing points. Please briefly describe your hardware and software environment including operating system, OpenCL vendor implementation, hardware card, etc.