why CPU execution time is less than GPU time?

// Hello.cpp : Defines the entry point for the console application.

#include “stdafx.h”
#include “CL/cl.h”
#define DATA_SIZE 10
const char *KernelSource =
"kernel void hello(global float *input , global float *output)
" size_t id =get_global_id(0);
"output[id] =input[id]*input[id];
//float start_time,end_time;

int main(void)
float start_time,end_time;
cl_context context;
cl_context_properties properties[3];
cl_kernel kernel;
cl_command_queue command_queue;
cl_program program;
cl_int err;
cl_uint num_of_platforms=0;
cl_platform_id platform_id;
cl_device_id device_id;
cl_uint num_of_devices=0;
cl_mem input,output;
size_t global;

float inputData[DATA_SIZE]={0,1,2,3,4,5,6,7,8,9};
float results[DATA_SIZE]={0};

int i;

//retrieve a list of platform variable
	printf("Unable to get platform_id

return 1;

//try to get supported GPU DEvice
	printf("unable to get device_id

return 1;

//context properties list -must be terminated with 0
properties[1]=(cl_context_properties) platform_id;

//create  a context with the GPU device

//create command queue using the context and device

//create a program from the kernel source code 
program=clCreateProgramWithSource(context,1,(const char**)

//compile the program
	printf("build error  

size_t len;
char buffer[4096];
//get the build log
printf(”----build Log—

// return 1;

//specify which kernel from the program to execute

//create buffers for the input and output


//load data into the input buffer


//set the argument list for the kernel command

//enqueue the kernel command for execution 

//copy the results from out of the buffer

//print the results

//cleanup-release OpenCL resources 

printf("execution time is%f",end_time-start_time); 
return 0;


in this program
(Intel i -5 3rd generation) CPU execution time is 750 ms ,while nvidia GPU takes 900 ms

why it is happening while GPU time should take less time than CPU time.

The problem you are running into is that GPUs are designed to handle huge amounts of work, not small amounts. To give some more detail, there is a certain amount of overhead present in copying data from host memory to device memory, then there is a certain amount of overhead in launching a kernel, and finally some overhead in copying the results back. The kernel launch overheads are fairly constant, while the transfer overheads depend on the size of the data plus a constant overhead from the driver.

If you send a lot of work to the GPU then these overheads account for a proportionally smaller part of the processing time.

The other problem is that your problem size is so small it does not even use a single streaming multiprocessor. To fully utilise a GPU, each work group should contain a few hundred threads and there should be a few hundred work groups.

With regard to our previous discussion on how many threads run in parallel, I should probably elaborate further. Each streaming multiprocessor executes a certain number of threads in parallel, 48 in your case in the best case scenario. Lets say these threads get to a memory access instruction. It doesn’t matter if it is global or local memory, both take a certain amount of time to return the data to the thread. During that time, these threads all block, waiting for the data from memory. Rather than sitting idle, the thread scheduler schedules more threads from a different thread group, called a warp in Nvidia terminology. Each multiprocessor can keep track of the execution status of several hundred threads - upto 1536 on your GPU if memory serves. That is why you need so many threads to make sure that the GPU does not sit idle.

Now for a comment on your chosen problem, squaring each element of a vector. This is not a good problem for a GPU because the amount of maths done for each element is less than the amount of memory access operations. PCIe bandwidth also hampers you here - it is actually the biggest short coming of your code. Using plain reads and writes, your maximum observed PCIe bandwidth could get to about 5GB/s. Since a float is 4 bytes, you can send about 10^9 floats to the GPU each second. Since each element results in one floating operation, that means you will cause about 10^9 floating point operations per second or 1GFLOPS. Even a single core of your CPU can beat that number of operations per second, plus it has slightly fewer overheads because the transfer from host memory to device memory would just be a copy from one location in RAM to another, which is faster than sending data over the PCIe bus. You need many more operations per element of data before a GPU becomes worthwhile.