First, you can avoid copy by using CL_MEM_ALLOC_HOST_MEM buffers. It can work worse for bandwidth instensive kernels though. Second, launch time of a single empty kernel isn’t really telling: it includes stuff like lazy buffer allocation and sending a command buffer across PCI-E, so running 3 or 4 of such kernels will practically make no difference. Third, if using host memory is not an option, you can try double buffering with multiple queues to hide transfer latency. It requires you to set all of dependancies right and was reportedly broken in Nvidia OpenCL driver some time ago, so you have to be careful with it.
Exactly. OpenCL isn’t good for a low-latency short calculation. The buffer and command queue overhead would always take longer than just doing the simple calculation on the host. OpenCL is about large data parallel calculations and problems which can be tweaked to be like that (and surely other things, but I’m trying to simplify).