Fermi: overlapping kernels from an in-order queue


I’ve been trying to profile some code by querying CL_PROFILING_COMMAND_START / CL_PROFILING_COMMAND_STOP on every event and adding up the times. However, I’m getting back overlapping intervals - in some cases 4 different kernels are supposedly running at once. I’m using a GTX 480, which does allow for multiple concurrent kernels, but I didn’t use CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE so I wouldn’t expect this to actually make sense.

Does anyone know whether the NVIDIA drivers are doing some kind of clever dependency analysis on my kernels to parallelise them, or is it just that the timer profile queries are giving bogus results?