Questions about synchronization between the host and the kernel

I am using Opencl 1.2 developing a program. I have a problem of the synchronization between the host (CPU) and my kernel. In my program, the host does some calculations on a global array, which is shared with the kernel. The kernel should be started after the host finishes its operations on the array, and passes it to the kernel, so I use an user event and clSetUserEventStatus() to control the process. However, it does not work. My code is below:

//initializing code

a=(cl_float*)clEnqueueMapBuffer(cmdQueue,buffer_a,CL_TRUE,CL_MAP_READ|CL_MAP_WRITE,0,sizeof(cl_float)*N,0,NULL,NULL,NULL);
err=clSetKernelArg(test,0,sizeof(cl_mem),&buffer_a);
cl_event userEvt=clCreateUserEvent(context,&err);

for(int x=0;x<N;x++)
*(a+x)=x;

clSetUserEventStatus(userEvt,CL_COMPLETE);
err=clEnqueueNDRangeKernel(cmdQueue,test,2,NULL,globalWorkSize,localWorkSize,1,&userEvt,&fEvt);
clWaitForEvents(1,&fEvt);

//the kernel
const char*test[] = {

               "__kernel void test (__global float * a, __global float *d ,__global float * out)

"
" {
"
//code…
"}
"
};

In this program, the value of array “a” load into the kernel “test” is wrong. It seems the kernel started before the host finishing the operation on array “a”. However, if I add a sleep() after the for loop, the value of array “a” load into the kernel is correct. The modified code is below:

//initializing the program

a=(cl_float*)clEnqueueMapBuffer(cmdQueue,buffer_a,CL_TRUE,CL_MAP_READ|CL_MAP_WRITE,0,sizeof(cl_float)*N,0,NULL,NULL,NULL);
err=clSetKernelArg(test,0,sizeof(cl_mem),&buffer_a);
cl_event userEvt=clCreateUserEvent(context,&err);

for(int x=0;x<8;x++)
*(a+x)=x;

Sleep(30);

clSetUserEventStatus(userEvt,CL_COMPLETE);
err=clEnqueueNDRangeKernel(cmdQueue,test,2,NULL,globalWorkSize,localWorkSize,1,&userEvt,&fEvt);
clWaitForEvents(1,&fEvt);

I also tried deleting the clSetUserEventStatus(userEvt,CL_COMPLETE) function, and it leads the program running into a dead lock state, waiting forever for the userEvt completed to start the test kernel to pass the clWaitForEvents(). I was confused. It seems the host run clSetUserEventStatus(userEvt,CL_COMPLETE) before it finished the for loop, though the sequence of them does not suggest this result. Could anyone please tell me what is wrong with my code?

I was wondering how to synchronize the host and the kernel, and how to force a kernel started after a certain point in the host. I would be grateful if anyone could help me figure this out?

Many thanks!
Tan

If you are mapping a buffer to initialise values on the host which are then used by a kernel on the device, you must un-map the buffer before enqueueing the kernel that uses it. This is all the synchronisation you need - there is no reason to try and manually synchronise with user events.

The typical code structure would look like this:

host_ptr = clEnqueueMapBuffer(..., buffer, ...);

// initialise values in host_ptr on the host

clEnqueueUnmapMemObject(..., buffer, host_ptr, ...);

clEnqueueNDRangeKernel(...);

Hi jprice, thanks for your help. I tried un-map the buffer, but still have some problems.

  1. I tried the code below

host_ptr = clEnqueueMapBuffer(…, buffer, …,sizeof(cl_float4),…);

*host_ptr=…;

clEnqueueUnmapMemObject(…, buffer, host_ptr, …);

clEnqueueNDRangeKernel(…);

This code gives me the right result. However if I used two large buffer (code below), the wrong result received again.

host_ptr1 = clEnqueueMapBuffer(…, buffer1, …,sizeof(cl_float)768512,…);
host_ptr1 = clEnqueueMapBuffer(…, buffer2, …,sizeof(cl_float)768512,…);

for(int x=0;x<768;x++)
for(int y=0;y<512;y++)
{(host_ptr1+x512+y)=…;
(host_ptr2+x512+y)=…;}

clEnqueueUnmapMemObject(…, buffer1, host_ptr1, …);
clEnqueueUnmapMemObject(…, buffer2, host_ptr2, …);

clEnqueueNDRangeKernel(…); //buffer1 and buffer2 both used

In the kernel, the value of the buffer passed in is different everytime. But if I add a sleep(30) before clEnqueueUnmapMemObject(…, buffer1, host_ptr1, …), I can get the correct value. I was wondering if there is any other reason that may affect the synchronisation?

  1. my code need to call a kernel in a loop. The kernel has a parameter updated in the host every iteration, so I map and un_map the buffer in the loop as below

host_ptr = clEnqueueMapBuffer(…, buffer, …,sizeof(cl_float4),…,&mEvt,NULL);

for(int x=0;x<100;x++)
{
*host_ptr=…; //buffer updated with the data calculated in the kernel in prevous iteration

clEnqueueUnmapMemObject(…, buffer, host_ptr, 1,&mEvt,&umEvt);

clEnqueueNDRangeKernel(…, 1,&umEvt,fEvt); //buffer used

clWaitForEvents(1,&fEvt);

//some calculations…

host_ptr = clEnqueueMapBuffer(…, buffer, …,sizeof(cl_float4),…,1,&fEvt,&mEvt,NULL);

}

In the first iteration of the loop, the buffer passed in the kernel is correct. In the second iteration, sometimes the result is correct and sometimes it is wrong. The more iterations it runs, the smaller opportunity for me to get the correct result. Again, if a sleep() is added before clEnqueueUnmapMemObject(…, buffer, host_ptr,1,&mEvt,&umEvt), times that the answer is corrected will be obviously increased.

I almost get mad with this synchronisation problem. Could you please give me some advice that how can I figure it out? Many thanks!

[QUOTE=jprice;37862]If you are mapping a buffer to initialise values on the host which are then used by a kernel on the device, you must un-map the buffer before enqueueing the kernel that uses it. This is all the synchronisation you need - there is no reason to try and manually synchronise with user events.

The typical code structure would look like this:

host_ptr = clEnqueueMapBuffer(..., buffer, ...);

// initialise values in host_ptr on the host

clEnqueueUnmapMemObject(..., buffer, host_ptr, ...);

clEnqueueNDRangeKernel(...);

[/QUOTE]

Do you create a new event on each iteration? They cannot be reused. Also, you might consider using a CPU-device for whatever you’re doing on CPU side instead of ordinary host side code. NVM Also, copy the whole your loop, it is unclear what exactly you are doing.

Hi Salabar,

You are right that I did not create new events on each iteration. Now it is solved after I used new event on each iteration. Thanks for your help.

I’d suggest you to get rid of events though. Unless you use out-of-order queue, it doesn’t make sence to use events on a single device and on a single queue. Perhaps, bring your code here so we could help?

Thanks Salabar. You are right, it is not necessary to use events in this order queue. I remove all events, and the result is still correct. Now I only use event to estimate the kernel execution time.