OpenCL for my application usefull?


I started to bring my C++ calculation to OpenCL.

Right now I have done some measurements.

As on measured on maximal amount data I get a calculation time of ~1200µs on my CPU
(Measured by QueryPerformanceCounter)

Than I was testing an empty OpenCL __kernel void just to measure clEnqueueNDRangeKernel and clEnqueueMapBuffer to get the result back.
There I get a time 2200-6000µs using an NVIDIA Quadro GPU.

So with the GPU I get at minimum of half speed because of task starting and reading back the memory.

Is there a faster way to start a task?
Or is it possible that a task is already started but only executed if data get pushed to the task?

First, you can avoid copy by using CL_MEM_ALLOC_HOST_MEM buffers. It can work worse for bandwidth instensive kernels though. Second, launch time of a single empty kernel isn’t really telling: it includes stuff like lazy buffer allocation and sending a command buffer across PCI-E, so running 3 or 4 of such kernels will practically make no difference. Third, if using host memory is not an option, you can try double buffering with multiple queues to hide transfer latency. It requires you to set all of dependancies right and was reportedly broken in Nvidia OpenCL driver some time ago, so you have to be careful with it.

Exactly. OpenCL isn’t good for a low-latency short calculation. The buffer and command queue overhead would always take longer than just doing the simple calculation on the host. OpenCL is about large data parallel calculations and problems which can be tweaked to be like that (and surely other things, but I’m trying to simplify).