I suppose most people are aware of this problem - when running an OpenCL program on multiple NVIDIA GPUs, creating a single context with multiple queues (one queue per device) serializes the execution. The only way I found to get around this is to create multiple contexts, one per device, and one queue/program per context, and run this in multiple parallel threads.
for example, here is an earlier report of the this issue:
I just tested on newer nvidia driver (418.56) with 2x Titan Vs, I still see the same behavior.
In comparison, AMD/Intel’s OpenCL allows concurrent executions on multiple queues under the same context.
I would like to check with this forum and see if there is a solution, as of 2020, to run a single kernel on multiple GPUs concurrently on NVIDIA GPUs without needing to create multiple contexts? there are huge overheads associated with duplicating memory buffers over multiple contexts.
it turns out that the shared RO_MEM buffers had caused the serialization of the kernels! it was not the fault of a single context, as I always thought to be.
after duplicating those RO_MEM buffers for each device and assign those duplicated buffer points to each kernel (i.e. clSetKernelArg(mcxkernel[i],... (void*)(buf+i)) ), I was able to get concurrent execution on NVIDIA GPUs, no need for multi-threading/multi-context