One context, multiple devices... what about output buffers?


I’ve now access to multiple GPU devices, and my algorithm can be easily scaled for several devices. However, after reading the docs and several web discussions, I find it quite confusing to understand output buffers when you’re using a single context for several GPU devices.

My current code is for one device only, so I’m trying to figure out how to modify it. It uses a pinned output buffer because I found a noticeable performance gain compared to a normal buffer.

As I said, the algorithm can be easily scaled, but… if I’m understanding it correctly, OpenCL assumes that a buffer in a context is exactly the same buffer to all devices, and, moreover, the OpenCL implementation is allowed to even track what device has the most updated version of the buffer… this fact complicates things in a way my brain cannot hold easily.

I’m happy that all input buffers are shared for all devices, so the single context scenario fits wonderfully.

However, the output buffer is a problem… I need a different output buffer per device. And with pinned memory. Can I get a different output buffer for each device?

Otherwise, I feel lost… how am I supposed to take advantage of several devices if their different results are going to be lost because the output buffer is the same for all of them? :frowning:

You create different output buffers for each device and merge them afterwards (don’t forget to define input buffers as read only to not confuse the runtime). In the case when output buffer is a final result of computation, merging does not necessarily mean copy. If it’s not, then make sure memory transfer time is not greater than benefit is compute time.

I guess this requires creating a different kernel (with the same source code, but different kernel object) for each device, doesn’t it? Otherwise, I don’t see how could I assign a different output buffer as kernel argument in each device, because kernel arguments are assigned per-kernel rather than per-device.

I believe, same kernel can be enqueued on each device corresponding program was compiled on. You create a kernel, assign arguments and send it on device1 queue. Then you assign new arguments and dispatch it on device2 queue. But if your solution in easier to inject into your code, you might as well go for it.

Thanks a lot. I didn’t know it was allowed to call clSetKernelArg() in that way, I thought changing the arguments could affect all queues if they still have non-flushed kernel executions on them.

Anyway, in my case, each device needs to execute the kernel several times, and I take advantage of only changing the arguments whose value is actually different across runs, in order to avoid unnecessary I/O overhead. So, if I enqueue the kernel for device #1, setting the output to buffer #1, then enqueue it for device #2, setting the output to buffer #2, then enqueue the second run for device #1 setting the output again to buffer #1, I’m afraid there could be some overhead for re-setting the buffer #1.

In other words, I think I’m going to create a different kernel for each device, so I don’t need to re-set the output buffer in each run.