clEnqueueReadBuffer is incredibly slow when called infrequently

Details: I’m on OS X, Iris Pro GPU - and I’m fairly new to OpenCL.

I have a few different buffers created through clCreateBuffer and some kernel tasks that operate on them.

What I am trying to do is run my kernel tasks as many times as I can within 1/60th of a second, and then copy one of the buffers to host memory so that I can render the result. I don’t want to copy the buffer out for rendering every time as it’s unnecessary to do that more than the frame requires it.

Here is the weird thing. If I call clEnqueueReadBuffer() every time after running my kernel code it takes about 6 milliseconds to complete. However, if I run my kernel code in a loop until 1/60 seconds have elapsed (so many iterations) and then call clEnqueueReadBuffer(), it takes about 4 to 5 SECONDS to complete.

Why is this happening, and how can I avoid this massive hit?

Incidentally, I’m actually using the EasyCL wrapper, so this is what is actually being called.

void CLWrapper::copyToHost() {
if(!onDevice) {
throw std::runtime_error(“copyToHost(): not on device”);

cl_event event = NULL;

error = clEnqueueReadBuffer(*(cl->queue), devicearray, CL_TRUE, 0, getElementSize() * N, getHostArray(), 0, NULL, &event);
cl_int err = clWaitForEvents(1, &event);
if (err != CL_SUCCESS) {
    throw std::runtime_error("wait for event on copytohost failed with " + easycl::toString(err) );
deviceDirty = false;


Do you flush the command queue regularly after calling clEnqueueNDRangeKernel ?
If not, you merely enqueue a lot of kernels but their execution is deferred until clEnqueueReadBuffer is called (the blocking read executes a command flush).

utnapishtim is probably right. I assume you do not force the queue to flush.
You should probably profile your application with CodeXL, because the Timeline clearly shows such problems.

Thanks, that was definitely the issue.

Unfortunately, I’m on OS X (Mac), and the OpenCL tools seem to be lagging or missing on that platform.

Rather than using clFlush, which puts a bubble in the pipeline, you could take events from your kernels, and have a short pipeline of enqueue (maybe 3-5) and before you enqueue more work wait on the event from the kernel at the head of your pipe. If you’ve got time for more work, enqueue more work, otherwise enqueue your read and display your frame, then repeat.

To put it in simpler terms, what you did was to spend 1/60 of a second making a list of work for the GPU to do, then sat and waited for it to finish, which took seconds. Instead you want to give it just a little work (3-5 items) and then wait for results to pop out the other side before adding more work to the list. When it’s close to the end of your 1/60 second window, stop adding more work and get your results. All of the OpenCL API is available on macOS to do this, using events. Note: Your might avoid using a profiling command queue, I’ve had crashes with those on macOS, at least when multiple threads are involved (maybe safe with single thread usage).