Slaving GPUs...

Is it possible in OpenCL to communicate GPU to GPU over PCIe? I understand that the CPU runs the “host” that does all of the cleanup, and some computation, and the GPU’s are almost used as a co-processor with “localities”, but is that as far as it can go?

I know that I run a crossfire setup, where the GPU’s are linked together, but could it be better to NOT run them in Xfire, and run a “sub-host” on one GPU, and let that do some “local-hosting” to offload/speedup some of the CPU to GPU work? Maybe, dual GPU cards could have one “master” and one “slave” chip, determined by the OpenCL app, and run different threads, like all of its read/write to global duties, almost like a GPU cacher.

Or, maybe I just drank a few too many? LOL :shock:

you can use the command ‘clEnqueueCopyBuffer’ to copy between two different devices of the same context, however I assume that the current OpenCL implementations copy the buffer via system memory. OpenCL does not explicitly expose peer-to-peer pci-e memory transfer functionality.
most hardware vendors refrain from direct memory transfers between gpu’s because not all chipsets support peer-to-peer functionality. Additionally it requires to expose the resource over the pci-e bar which is stressed out as it is.
if you have nvidia hardware you can avoid the overhead of memory transfers using their concurrency paradigm. (i.e. perform memory transfers of one command queue concurrently with the execution of a kernel of a different command queue).

does anyone know if ATI have implemented a similar functionality ?

Are you sure this works? I once tried to overlap data transfer and kernel execution on an NVidia GPU, but without success (I used events to see when the commands were actually executed)
And if it works, can’t you simply use an out-of-order queue instead of two queues?

Yes, i am sure it works. I managed to implement a concurrently executed application.
There are a few important implementation details:
1.) create a system memory buffer using ‘clCreateBuffer’ with ‘CL_MEM_ALLOC_HOST_PTR’ flag.
2.) map the system memory buffer with ‘clEnqueueMapBuffer’ .
3.) write or read from a device buffers using ‘clEnqueueReadBuffer’,‘clEnqueueWriteBuffer’. (use the pointer from stage 2).

clEnqueueCopyBuffer will not execute concurrently.


Tzachi Cohen

I’m a bit confused here. If you use CL_MEM_ALLOC_HOST_PTR you are forcing the device to keep a backing of the memory on the host, which would appear to defeat the point of doing a device-to-device copy in the first place. Also, if you map a buffer and then call enqueue read or write before unmapping it, you’d appear to be in the “undefined behavior” category of the spec. (You have to unmap before doing other operations on the buffer.) Perhaps I’m missing something in your explanation?

My understanding is that it is up to the driver vendor to implement things like clEnqueueCopy as efficiently as they can. I would hope that if Nvidia’s cards support the PCIe DMA they would use it, and otherwise revert to the CPU-based copy.

i did not try to perform peer-to-peer copies. it is my understanding that with OpenCL gpu-to-gpu transfers can only be performed via system memory.
My objective is to minimize the memory transfer times.

while it is illegal to launch a kernel that uses a mapped memory object you can do the following :

1.) map resource A.
2.) call ‘clEnqueueWriteBuffer’ for resource B with resource A mapping pointer.
3.) unmap resource A .
4.) use resource B with some kernel.

the cool trick is that stages 1-3 can be performed concurrently with the execution of a different unrelated kernel on the same gpu, thus i am completely hiding the transfer times.
if i naively map resource B->write data to the buffer -> unmap it -> use it with a kernel, then the memory transfer will not be asynchronous to the gpu.

From tests performed i concluded that ‘clEnqueueCopy’ isn’t concurrent. I don’t know why.

Sounds like this is a performance bug in the OpenCL library. The copy should be async if the hardware supports it. If you copy through system memory you are forcing a GPU->CPU->GPU transfer, when you could (if the hardware+software support it) get GPU->GPU via clEnqueueCopy. Can you file a bug against the vendro.

while the OpenCL standard specifies that every ‘Enqueue’ command should be executed asynchronously to the CPU. (i.e. return control from the function call before the command is actually executed).
The spec (to the best of my knowledge ) does not define any asynchronous execution from the gpu itself. (i.e. the gpu does not have to be able to process more than one command at the same time. )

The ability to process two commands at the same time (to the best of my knowledge) is a unique nvidia feature, which gives their products an added value. (and i am not nvidia groopy) .

I did not try to prove that ’ clEnqueueCopy’ executes via system memory when multiple gpu’s are involved, but i am willing to bet 50 $ on it.