I suppose most people are aware of this problem - when running an OpenCL program on multiple NVIDIA GPUs, creating a single context with multiple queues (one queue per device) serializes the execution. The only way I found to get around this is to create multiple contexts, one per device, and one queue/program per context, and run this in multiple parallel threads.
for example, here is an earlier report of the this issue:
I just tested on newer nvidia driver (418.56) with 2x Titan Vs, I still see the same behavior.
In comparison, AMD/Intel’s OpenCL allows concurrent executions on multiple queues under the same context.
I would like to check with this forum and see if there is a solution, as of 2020, to run a single kernel on multiple GPUs concurrently on NVIDIA GPUs without needing to create multiple contexts? there are huge overheads associated with duplicating memory buffers over multiple contexts.