clEnqueueReadBuffer blocking always

clEnqueueReadBuffer( myQueue, myMem, CL_FALSE, 0, size, hostBuffer, 0, NULL, NULL );

I don’t success on this call to be non-blocking…
In theory as blocking is CL_FALSE it must return immediatly, but I can see as if I increase the value of size, the computed time of clEnqueueReadBuffer is increased too, it’s depending on the size.

I have done a lot of tests to check this, with timers, by varying the size, using the last parameter cl_event not null, by printing the result of hostBuffer is always correct even if I start immediatly to copy the hostBuffer to another buffer, in theory in a asynchronous way, I get the result correct because it seems clEnqueueReadBuffer do not return till execution is completelly finished, behaviour expected if I’d use CL_TRUE on blocking, but not the case.

I even reinstalled a couple of different nvidia drivers with same luck…
There is any initialization or configuration of opencl I must set correctly to allow working in a non-blocking way?
what i’m missing?

I am having the same issues withe clEnqueueWriteBuffer that even for the non-blocking calls the execution time is increasing with the increase in block size. Although clEnqueueReadBuffer seems to be working fine in my case.

I report more details to the problem, with an example:

================

cl_event eventKernel;
cl_event eventRead;

clEnqueueNDRangeKernel( queue, kernel, 1, …, 0, NULL, &eventKernel );
printEventStatus( eventKernel );

clEnqueueReadBuffer( queue, mem, CL_FALSE, …, 0, NULL, &eventRead );
printEventStatus( eventRead );
printEventStatus( eventKernel );

================

the execution of this code is:
“event kernel queued” <- ok
“event readbuffer complete” <- ???
“event kernel complete”

I’ve used even 5 megabytes of size to transfer with same result. clEnqueueReadBuffer is blocking the execution always.

It’s a bug? it’s my fault? I missed smth? It’s expected behavior?
Can anyone throw some light in here?

Hi Asgard,

I have a silly question. Please do not mind if it is too silly.

In your code you only enqueue commands, at which point do you start issuing them?

Why don’t you need clFlush() or clFinish() or clWaitEvents here? I am so confused because the code gives the correct answer…

Thanks!

Best regards,
Mingcheng Chen
May 18th, 2012

Hi linyufly.

Ok, maybe I have not exposed the problem correctly, my complain is not about the correctness of the result, but the concurrence, it interrupts my CPU thread code:
Let me explain in detail…

clEnqueueNDRangeKernel() &lt;- this method enqueues a kernel to be executed
clEnqueueReadBuffer() &lt;- this method enqueues a operation of copy transference from GPU to CPU ( device to host )
clFlush() &lt;- this method issues all commands in the queue 
clFinish() &lt;- this method issues all commands in the queue AND blocks cpu until they are completed executed ( synchronization point )

So it’s important to mention that the execution in the GPU of the enqueued operations will begin when:

a ) clFlush() is called… you are forcing to opencl to start executing the previous enqueued opearations
b ) Some operations have <blocking> parameters like clEnqueueReadBuffer() and if CL_TRUE is passed on that blocking parameter it’s like calling clFlush() too
c ) Other operations are implicitelly flushing like clWaitForEvents()
d ) clFinish() is called… you are forcing to opencl to start executing the previous enqueued opearations AND it blocks your cpu thread until these executions are completed
e ) driver decision to flush/finish automatically when some circumstances are met ( enough kernels or operations enqueued, or other internal reasons probably depending on your GPU specifications, number of cores, memory etc ).

Important to mention the next method too:

clGetEventInfo() returns one of the next execution status:

CL_QUEUED (command has been enqueued in the command-queue),
CL_SUBMITTED (enqueued command has been submitted by the host to the device associated with the command-queue),
CL_RUNNING (device is currently executing this command),
CL_COMPLETE (the command has completed)

Now, back to my code:

================================================================

[b]cl_event eventKernel;
cl_event eventRead;

clEnqueueNDRangeKernel( queue, kernel, 1, …, 0, NULL, &eventKernel );
printEventStatus( eventKernel );

clEnqueueReadBuffer( queue, mem, CL_FALSE, …, 0, NULL, &eventRead );
printEventStatus( eventRead );
printEventStatus( eventKernel );
[/b]
( NOTE: printEventStatus() it’s a method made by myself that consult clGetEventInfo() and print the status for debugging purposes. )

================================================================

Explanation of the code:

I call clEnqueueNDRangeKernel() so it enqueues a kernel
After that I consult the generated event with clGetEventInfo() and it says CL_QUEUED as expected
It means the kernel has been queued, not issued yet nor executed.

Now I call clEnqueueReadBuffer() with blocking parameter to CL_FALSE

According to http://www.khronos.org/registry/cl/sdk/ … uffer.html
"If blocking_read is CL_TRUE i.e. the read command is blocking, clEnqueueReadBuffer does not return until the buffer data has been read and copied into memory pointed to by ptr.
“If blocking_read is CL_FALSE i.e. the read command is non-blocking, clEnqueueReadBuffer queues a non-blocking read command and returns.”

What I expect when I consult the generated event by this method? I expect: CL_QUEUED

But what I find is CL_COMPLETE !!
It means clEnqueueReadBuffer() is ignoring my CL_FALSE.

So briefly:

clEnqueueReadBuffer blocks my CPU thread,

It flushes all previous queue operations
start issuing my kernel in this case and executes it in GPU
executes the readBuffer, and transfer the data from GPU to CPU.
Set the events to the COMPLETE status…
and finally:

returns the control to my CPU thread that has been stopped all this time breaking the concurrence.

That’s my complain. As you see I don’t even need to call clFlush to issue because of clEnqueueReadBuffer. And according to Khronos documentation it must not happen this way.

Hi Asgard,

Thank you very much for your so detailed explanation!

I have observed the same result as you have.

Thanks again!

Best regards,
Mingcheng Chen
May 19th, 2012

[quote=“Asgard”]

Hi linyufly.

Ok, maybe I have not exposed the problem correctly, my complain is not about the correctness of the result, but the concurrence, it interrupts my CPU thread code:
Let me explain in detail…

clEnqueueNDRangeKernel() &lt;- this method enqueues a kernel to be executed
clEnqueueReadBuffer() &lt;- this method enqueues a operation of copy transference from GPU to CPU ( device to host )
clFlush() &lt;- this method issues all commands in the queue 
clFinish() &lt;- this method issues all commands in the queue AND blocks cpu until they are completed executed ( synchronization point )

So it’s important to mention that the execution in the GPU of the enqueued operations will begin when:

a ) clFlush() is called… you are forcing to opencl to start executing the previous enqueued opearations
b ) Some operations have <blocking> parameters like clEnqueueReadBuffer() and if CL_TRUE is passed on that blocking parameter it’s like calling clFlush() too
c ) Other operations are implicitelly flushing like clWaitForEvents()
d ) clFinish() is called… you are forcing to opencl to start executing the previous enqueued opearations AND it blocks your cpu thread until these executions are completed
e ) driver decision to flush/finish automatically when some circumstances are met ( enough kernels or operations enqueued, or other internal reasons probably depending on your GPU specifications, number of cores, memory etc ).

Important to mention the next method too:

clGetEventInfo() returns one of the next execution status:

CL_QUEUED (command has been enqueued in the command-queue),
CL_SUBMITTED (enqueued command has been submitted by the host to the device associated with the command-queue),
CL_RUNNING (device is currently executing this command),
CL_COMPLETE (the command has completed)

Now, back to my code:

================================================================

[b]cl_event eventKernel;
cl_event eventRead;

clEnqueueNDRangeKernel( queue, kernel, 1, …, 0, NULL, &eventKernel );
printEventStatus( eventKernel );

clEnqueueReadBuffer( queue, mem, CL_FALSE, …, 0, NULL, &eventRead );
printEventStatus( eventRead );
printEventStatus( eventKernel );
[/b]
( NOTE: printEventStatus() it’s a method made by myself that consult clGetEventInfo() and print the status for debugging purposes. )

================================================================

Explanation of the code:

I call clEnqueueNDRangeKernel() so it enqueues a kernel
After that I consult the generated event with clGetEventInfo() and it says CL_QUEUED as expected
It means the kernel has been queued, not issued yet nor executed.

Now I call clEnqueueReadBuffer() with blocking parameter to CL_FALSE

According to http://www.khronos.org/registry/cl/sdk/ … uffer.html
"If blocking_read is CL_TRUE i.e. the read command is blocking, clEnqueueReadBuffer does not return until the buffer data has been read and copied into memory pointed to by ptr.
“If blocking_read is CL_FALSE i.e. the read command is non-blocking, clEnqueueReadBuffer queues a non-blocking read command and returns.”

What I expect when I consult the generated event by this method? I expect: CL_QUEUED

But what I find is CL_COMPLETE !!
It means clEnqueueReadBuffer() is ignoring my CL_FALSE.

So briefly:

clEnqueueReadBuffer blocks my CPU thread,

It flushes all previous queue operations
start issuing my kernel in this case and executes it in GPU
executes the readBuffer, and transfer the data from GPU to CPU.
Set the events to the COMPLETE status…
and finally:

returns the control to my CPU thread that has been stopped all this time breaking the concurrence.

That’s my complain. As you see I don’t even need to call clFlush to issue because of clEnqueueReadBuffer. And according to Khronos documentation it must not happen this way.[/quote]

Just to notify it works correctly with Intel opencl sdk
[OpenCL CPU Runtime version - 2.0.0.31360]

“clEnqueueReadBuffer()” method is not blocking my CPU execution when I use CL_FALSE as I expected.

But when I download any NVIDIA driver and try it, generates the unexpected behaviour.
On the other hand, I have no idea what happens in AMD. I can’t check it with my nvidia GPU

linyufly, what platform have you used to check it? nvidia/amd/intel/…