I’ve a program (C code + OpenCL API) with 3 main operations, in a big for cycle:
- send data to GPU, (clEnqueueWriteBuffer)
- process that data with kernel (clEnqueueNDRangeKernel) and
- send output back to CPU. (clEnqueueReadBuffer)
The dependency in each iteration is 1->2->3. Iterations do not interfere with each other.
The kernel is big enough to hide the data transfers times.
As you see in figure, the top one is if I perform everything in-order.
The bottom one, is with overlap, as I desire.
I implement a double buffer scheme (meaning, having 2 input buffers and 2 output buffers and using them interleaved), so that I can perform the data transfers of the next kernel execution while performing the current kernel.
You see, I want to make that (1) of penultimate line while making (2) from previous iteration.
My ambition is to have a time that only depends on the kernel, and everything else is “hidden”.
I’ve out-of-order execution on a single command queue, for a single GPU device.
I do not explicitly wait for events whatsoever.
I just make sure that every operation doesn’t start before the previous dependent operations have finished. I made that using events returned from every single operation I perform.
I’ve defined the dependencies:
in same iteration: 2 not before 1 finish, 3 not before 2 finish.
across 2-2 iterations (as i use double buffer, and every 2 iterations, the kernel uses the same buffer): 1 not before 2 finish (i can’t write on the buffer the kernel still uses), and 2 not before 3 finish (i can’t execute kernel if i haven’t retrieved the data).
The problem is that profiling, the time the cycle takes is the time of the summed operations and not only from the kernel operations (except first and last cycle overhead), meaning, that the overlap is not being done.
Should also be noted that each transfer of data is between 200 and 500MB.
So, it appears that I’ve the event synchronization well done, but the overlap is not occurring. How to you would solve this?
Any ideas? I’ve read in a forum (stackoverflow probably) that I would need more than one command queue to perform multiple operations at the same time, but from the the books I’ve read, it was stated that data transfers can occur simultaneously on a single queue (using DMA).