Simple question about clFinish()


Simple question related to the code bellow. It’s not a real code, it doesn’t compile, it’s just a simple example.

cl_mem image0;
cl_mem image1;
cl_mem image2;
cl_mem buffer;


// Step1
f(image0, image1, 1);
f(image0, image2, 2);
g(image1, image2, buffer);

// Step2
f(image0, image1, 3);
f(image0, image2, 4);
g(image1, image2, buffer);

// Step3
f(image0, image1, 5);
f(image0, image2, 6);
g(image1, image2, buffer);

We basically have 3 images allocated on the device. The function f(input, output, p) apply a kernel that fills the output with values read from input given a parameter p. For instance, f could be a Gaussian smoothing where p would be the variance of the Gaussian kernel.

The function g takes two image inputs and a buffer as an output. In g, the kernel analyses the two inputs and write something in the output buffer. For instance, g could detect the local maxima in both inputs.

Because we apply the “algorithm” 3 times here (3 steps) and because we re-use the same memory space at each step (image0 and image1, buffer grows at each step), I was thinking that maybe I should use a clFinish() between each step. I’m affraid that if I don’t, the step 2 may start before step 1 is finished which would lead to an incorrect behavior of function g in step 1.

What do you think?


I think, you should have a look into OpenCL events, setup you task graph with in-order queues and do a blocking read at the end. But you should also read what Intel has to say about explicit synchronization on CPUs.

Do you append data to buffer or do you want to write the same location in buffer?

I append data to the buffer using function g at the end of each step. I’m afraid that the blocking read at the end won’t be enough but I’m not sure, hence my question.

If you set up everything correctly using events, the last blocking read will be sufficient. It’s just necessary to start the computation.

EDIT: Actually, if you use a single, in-order queue events aren’t even necessary because the enqueued commands will be executed in order of submission and wait until all preceeding commands are finished.

If I understand what you are saying, if I use a simple command queue (i.e. I don’t specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ ENABLE), then all what I put in the queue will be execute in order and one command will be executed only when the previous one is finished? If yes, then it’s perfect, I don’t have anything more to do. Can you confirm?

The standard is clear on this (emphasize mine):

If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A. If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a commandqueue is set, then there is no guarantee that kernel A will finish before kernel B starts execution.

Yeap, thank you!