I want to measure the execution time of a OpenCL kernel in a loop. When I use clGetEventProfilingInfo()
to measure the time for enqueueing, submitting and starting the kernel, I encountered an massive overhead for the time between kernel submission and the start of execution on the device, compared to the actual execution time. I use the following structure to profile the kernel:
cl_command_queue queue;
queue = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE, &err);
clEnqueueWriteBuffer(queue, kernel_data_d, CL_TRUE, 0,
sizeof(cl_float) * kernel_data_size,
kernel_data_h, 0, NULL, NULL);
for (int i = 0; i < N; ++i) {
cl_event test;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &test);
clWaitForEvents(1 , &test);
clFinish(queue);
clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_SUBMIT, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_START, sizeof(time_end), &time_end, NULL);
time_passed_kernel += (time_end-time_start)/1e6;
clReleaseEvent(test);
}
It takes around 350ms on average for the execution start after the kernel has been submitted, and only 61ms on average for the actual execution. Is there any specific reason for this overhead? The data accessed by the kernel is on the GPU and the kernel has been compiled.