Running kernel on host vs device

In the Codeproject example:

// create data for the run
float* data = new float[DATA_SIZE];

// Create the device memory vectors
input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL);

// Transfer the input vector into device memory
err = clEnqueueWriteBuffer(commands, input, CL_TRUE, 0, sizeof(float) * count, data, 0,

// Set the arguments to the compute kernel
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);

// Execute the kernel
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

Question is if I can choose between CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU, when executing on the host, how would the kernel use data on the host? It seems to me that in clSetKernelArg, the kernel is always set to use &input, which is on the device, and that doesn’t make sense when running on the CPU.

Any clarification is much appreciated.

With AMD and Intel OpenCL platform drivers, you can select OpenCL devices that are the CPU instead of the GPU.

The rest of OpenCL works just like it would with a GPU.

With your code, clEnqueueWriteBuffer copies data from CPU memory to another part of CPU memory, and then when you execute your kernel on the CPU, it access that memory.

If you know you are running on the CPU, using clEnqueueMapBuffer can be faster because memory isn’t copied, just ownership changes (when mapped you can access the buffer from your main code, when unmapped from kernels; the map and unmap calls are fast).

Dithermaster, thanks very much for your response.

So only when device type is set to CL_DEVICE_TYPE_GPU, does clEnqueueWriteBuffer actually copies the data to the device over PCIe, causing the long delay?


Yes, for GPU clEnqueueWriteBuffer enqueues a command which will asynchronously copy the data over the PCIe bus. If speed is paramount here, read the vendor documentation on how to maximum speed, for example, by using pinned buffers. You could also switch to a model where you use clEnqueueMapBuffer, which always runs at full PCIe bandwidth.

Dithermaster, thanks very much for the explanation.


where are your infos abput the full PCIe bandwith from? Do you have a source for that?

Thanks in Advance,

From each manufacturer’s OpenCL documentation. The each have guides that recommend the fastest way to transfer data to their devices.

And where did you get those specs? I could not find them on the nvidia page. need to know what binary i till get from CL_PROGRAM_BINARY :?