Where, get_time() uses gettimeofday() to get the current time in seconds as double.
When the CPU is used as the OpenCL device the difference between gpu_total_time and gpu_profiling_time makes sense.
However, when I use my GPU (AMD 6750M, on MacBook Pro) the overhead is sometimes huge, 0.000619s compare to 0.032589s (~X50 slower when measured from the host side).
The problem is consistent with specific kernels.
Here is the prototype of the kernel if it helps:
kernel void resize(
__read_only image2d_t src,
__write_only image2d_t dst,
Note that the problem does not exist on Windows with NVidia hardware (at least for the specific device that I tried).
It looks like the way you are measuring time on the host side is incorrect. Starting the clock before calling clEnqueueNDRangeKernel() and stopping it once it returns doesn’t measure the same thing as gpu_profiling_time.
clEnqueueNDRangeKernel() is analogous to ordering a pizza. It takes very little time. However, what you want to know is how long it takes to bake the pizza in the oven. That’s what gpu_profiling_time is giving you.
This has been discussed a few times in the past. I suggest using the search feature to find more information.
Sorry, I forgot to mention that I do use queue.finish() before calling get_time() at second time on the host side. I’m aware that clEnqueueNDRangeKernel() is a non-blocking operation.
However the problem exists despite the using of queue.finish(). Something is really strange there.
I see the same thing and believe this first-time delay is normal and predominately related to lazy buffer allocation on the compute device.
You didn’t post you kernel code / arguments, but I can speculate.
If you have an output buffer declared on the device, the implementation has no reason to actually allocate it until you run the kernel the first time.
Even the kernel code might not be moved over to the device until the first use then cached afterward.
That makes sense, thanks.
I think that the specification should allow prevention of such lazy operations by adding flags to the kernel and the buffer constructors.
It might be the case that the application can do the allocation asyncroniuously but not the running of the kernel itself. In such cases, the lazy approach is a waste of time.
How is the lazy approach a waste of time? It only happens once, it still has to happen once no matter what happens. If you’re doing any micro-benchmarks it is a given that you cannot get reliable results if you do not let the system `warm up’ first - i.e. do a couple of dummy runs.
BTW the ‘lazy allocation’ is at the operating system level, and beyond such a specification’s scope. Although a unix process can be given a big virtual address space, those pages do not exist until they are accessed. This is not something a driver could change.