About the cache coherence between CPU and descrete GPU

Hi all,

I am studying how to create an interactive connection between a CPU and a descrete CPU on OpenCL.

I create a buffer(unsigned long) with CL_MEM_USE_HOST_PTR flag, and then I pass the buffer to be a kernel argument.

I am sure the buffer is pinned in host memory by the report of the CodeXL.

The kernel code for GPU is an infinite loop to probe the buffer, if the buffer is not zero, the loop will break.

After issue the kernel code, the CPU-side program waits for a while and then changes the buffer to 1.

I expect that the GPU could notice the buffer is not zero and then exits the infinite loop.

However, the result is the loop cannot be terminated.

I doubt it is because the GPU has cached the content of the buffer and the vaule is still zero.

Will the hardware or the dirver keep cache coherence?

If the answer is no, can I flush the cache of GPU?

the partial code

main thread

 // Create a command queue
  cl_command_queue command_queue = clCreateCommandQueue(context, device_id,
                                                        CL_QUEUE_PROFILING_ENABLE ,

  cl_command_queue memop_queue = clCreateCommandQueue(context, device_id,
                                                        CL_QUEUE_PROFILING_ENABLE ,
  char *buf_flag = malloc(sizeof(unsigned long));

  cl_mem flags=clCreateBuffer(context,CL_MEM_READ_WRITE |CL_MEM_USE_HOST_PTR, sizeof(buf_flag) ,buf_flag, &ret);
  ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&flags);
  ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
                         &global_item_size, &local_item_size, 0, NULL,
     printf("kernel ack

Kernel part

__kernel void test_host(__global unsigned long *flags) {


   do something


This is not valid in OpenCL 1.x; you can’t have a buffer accessed from both CPU and device. You must use clEnqueueMapBuffer to get CPU access and clEnqueueUnmapMemObject to give access back to the device.

OpenCL 2.0 has an SVM mode that does what you want, but there are no shipping implementations yet.

Some vendors do have device “on demand” host memory read/write over PCIe (using vendor flags in the buffer creation) in OpenCL 1.x; perhaps those would do what you want.