overlapping data transfers and kernel execution

Hi there,

I thought that by creating two queues for one device, it would be possible to overlap data transfers and kernel execution.
An example:
I transfer the data for the first kernel to the device. Once it’s there, I start the first kernel. In the meantime I transfer the data for the second kernel to the device (to a different part of the device memory, of course).
In theory, it should be possible to overlap the execution of the first kernel with the data transfer for the second kernel, right?

I wrote a small test program and used profiling to see when the commands are executed. But even if the first kernel is running for quite a while, the second data transfer to the device only starts when the kernel execution has finished. Is that a limitation of the hardware? Or of the Nvidia implementation? Or am I missing something?


I just realized that it’s apparently not possible to have enqueue-commands asynchronously (i.e. non-blocking). A clEnqueueWriteBuffer() call takes the same time with and without blocking. Also enqueuing a kernel execution only seems to return after the kernel has been executed…

I’m using the NVidia SDK, so I guess it’s a limitations of that. Has anyone else had the same problem?


What “execution ordering” did you specify when you created your command queue (clCreateCommandQueue), that is, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE or not

I didn’t specify anything. I thought that CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE means that commands WITHIN a queue can be reordered. However, what I want is commands in DIFFERENT queues to overlap and commands in a single queue should be executed in order.

Queues should be asynchronous. On Mac OS X, at any rate, the non-blocking commands will return virtually instantly while the blocking ones will wait. If Nvidia’s driver is not working that way then they have a serious performance bug. :frowning:

You should certainly be able to get data movement and computation to be scheduled together by using two queues, but whether the runtime will actually overlap them depends entirely on the implementation. A lot of cards have DMA engines that can support this, but I don’t know of any vendors that are actually using this. If you use an out-of-order queue, the runtime should be able to do the same thing.

You’re right. I had the opportunity to run my program on a MacBook and non-blocking memory commands returned immediately. But using the Nvidia implementation the blocking parameter seems to be ignored…
However, kernel computation and data transfers weren’t overlapped on the MacBook either, but that’s maybe just because it didn’t have a dedicated graphics cards, I think.

What exactly do you mean by “using”. Do you mean in terms of OpenCL or in general? If the cards have DMA engines then why shouldn’t they be used?

What I mean by not using DMA engines is that the only way that I’m aware of to overlap compute and transfer on current generation cards is to use one of the DMA engines on the card to do the transfer while the kernel is running. There are, unfortunately, a lot of limitations on how these can be used since they were really designed for efficient graphics. I don’t know of any implementations today that use them to allow you to overlap transfers and compute. If the Nvidia OpenCL driver is so broken as to not allow non-blocking commands they obviously aren’t doing this.

However, you may not need to have separate queues to future-proof your design. If you have an out-of-order command queue, the runtime should be free to optimize the scheduling of the commands as best it can. So your best bet would be to just try to use an out-of-order queue if it’s available, and hope the runtime does the right thing. (I.e., if it doesn’t do the right thing with an out-of-order queue, I doubt having two queues is going to make a difference.)

OK, I see. Thanks a lot for your reply!

I believe part of the issue is that the path of least resistance to trying to overlap transfers and execution is to create two command queues to a single device. Then spawn a CPU thread to issue commands to each queue. Out of order command queues, while clever, are much more difficult to program in the host code.

I agree that using two queues and having a CPU thread each is probably a safe way because it doesn’t rely on the OpenCL implementation to support stuff like non-blocking reads/writes.
However, I think that compared to CUDA, the OpenCL approach of having queues and therefore avoiding the need for multiple CPU threads (e.g. when using more than one device) is quite elegant and actually makes it easier to write the host code because you don’t have to worry about synchronization etc.

Having separate command queues on separate threads is just as difficult to program as an out-of-order command queue in OpenCL. (Both need the same level of synchronization/dependencies to get correct operation; in one case it is through cl_events, in the other through OS-locks.)

Keep in mind that all of the command queues go to the same device in the runtime, so if it is possible to overlap these at the device level, using an out-of-order queue should do it. (Indeed, if it’s possible then a good runtime should do it regardless of whether the queue is out-of-order as long as it doesn’t have any dependency issues.) If the out-of-order queue doesn’t do this, then there’s no reason to believe separate CPU threads will do it. (They will most likely just re-order in some internal queue in the runtime.)

I would not recommend having multiple command queues unless it simplifies your program, which would imply that they are truly independent.