OpenCL Dual Copy Engines

Before I waste more time trying to figure this out, I had a quick question.

In OpenCL, is it possible to pass input data, execute a kernel, and read output data back out at the same time?
There is some start up time required (2 input data portions as well as 1 kernel execution) before they can all operate at the same time.
I’m currently trying to implement this using 3 queues.

I using a Quadro K5100M.

If you can give me any insight on whether this is possible or not, I would greatly appreciate it.
As well if you know that it is possible in CUDA would be nice to know as well.

Thank you,


Sure, this is one of use-cases, but NVIDIA used to serialize commands submitted from different command queues. I don’t know if this is fixed by now.

So I went ahead and tried it, and I was able to get it work.
I’m running into some other issues regarding mapping/unmapping buffers but I’m sure I’ll get those resolved.

The method that I went with is inputting, running, and outputting one data set on one queue and just copying that process over multiple queues.

A simple text based description.

The other method I attempted where input is on queue 0, kernels are on queue 1, and output is on queue 2 has yet to work.

You can use the queue 1 for both input and output by the way. You only have one PCI-E, don’t you?

Yes I only have 1 PCI-E.

But I want to overlap input/compute/output.
Using 1 queue for both input and output wouldn’t allow me to overlap input and output data transfer due to the way In-Order Queues work. (AFAIK)

Also an interesting note, when the input and output are transferring at the same time, there is a slight slowdown in transfer speed.
Generally, I get 11.5G Gbps when only one transfer is occuring, but when two are occurring at the same time, I get speeds of roughly 10.5 Gbps (input) and 9.5 Gbps (output).
Overall, a speedup still exists but not quite 2X.

I have an open source project that you can try driver-controlled pipelining and event controlled pipelining for separable kernels(can both upload+download+compute at the same time for all stages, per device) and also device to device pipelining for non-separable kernels(this just overlaps host transitions with device computes (computes are serial with pci-e movements), will upgrade it later so it will overlap everything including pci-e).

Driver controlled one uses 16 queues so you can try at least 16 blobs overlapped with different stages(read,write,compute)

Event controlled uses 6 queues(read queue + write queue + compute queue and all these is duplicated )

Device to device pipeline uses only 1 queue per device but overlaps array copies between devices(through RAM, for compatibility) with computings in all devices.

but its C# though. I was getting speedups with HD7870 but not with R7-240. Low end cards are not given ability to overlap computes with movements.