OpenCL overhead by kernel invocation

I want to measure the execution time of a OpenCL kernel in a loop. When I use clGetEventProfilingInfo() to measure the time for enqueueing, submitting and starting the kernel, I encountered an massive overhead for the time between kernel submission and the start of execution on the device, compared to the actual execution time. I use the following structure to profile the kernel:

cl_command_queue queue;
queue = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE, &err);

clEnqueueWriteBuffer(queue, kernel_data_d, CL_TRUE, 0,
                                  sizeof(cl_float) * kernel_data_size,
                                  kernel_data_h, 0, NULL, NULL);
for (int i = 0; i < N; ++i) {

  cl_event test;

  clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &test);

  clWaitForEvents(1 , &test);
  clFinish(queue);
  clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_SUBMIT, sizeof(time_start), &time_start, NULL);
  clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_START, sizeof(time_end), &time_end, NULL);
  time_passed_kernel += (time_end-time_start)/1e6;
  clReleaseEvent(test);

}

It takes around 350ms on average for the execution start after the kernel has been submitted, and only 61ms on average for the actual execution. Is there any specific reason for this overhead? The data accessed by the kernel is on the GPU and the kernel has been compiled.

First, I would add a clFlush() before the clWaitForEvents() and see if there’s any impact.
Afterwards:

The idea with efficient kernel launching is to load the queue ahead of time. I.e. to have a few kernels “in-flight” in the queue at a given time. In your example this is not possible because you’re waiting on the kernel completion (either clWaitForEvents or clFishish does that) before launching the next kernel. Thus you have at most one kernel “in-flight” in the queue.

One way I would try is : allocate an array of N events, launch all the N kernel instances without any waiting (i.e. without clFinish or clWaitForEvents).

After the for loop, do a clFlush() and a waitForEvents over all the N events. Or do just a clFinish(). After the clFinish() all the events should have the timing info for each kernel launch.