[Solved] Sequential execution on NVIDIA OpenCL using multiple GPUs

I suppose most people are aware of this problem - when running an OpenCL program on multiple NVIDIA GPUs, creating a single context with multiple queues (one queue per device) serializes the execution. The only way I found to get around this is to create multiple contexts, one per device, and one queue/program per context, and run this in multiple parallel threads.

for example, here is an earlier report of the this issue:

I just tested on newer nvidia driver (418.56) with 2x Titan Vs, I still see the same behavior.

In comparison, AMD/Intel’s OpenCL allows concurrent executions on multiple queues under the same context.

I would like to check with this forum and see if there is a solution, as of 2020, to run a single kernel on multiple GPUs concurrently on NVIDIA GPUs without needing to create multiple contexts? there are huge overheads associated with duplicating memory buffers over multiple contexts.

This is strange - I just tried the “OpenCL Simple Multi-GPU” example in nvidia opencl sdk:

it appears that in the sample code, the kernel was executed in parallel on multiple GPUs, even though there is only a single context. But my similarly structured code was serialized for some reason :frowning:

here is a comparison between the launching + waiting part of the SDK example and my code

does anyone see a major difference that prevents my kernel from running in parallel?

to see the execution is serialized, you can run

git clone https://github.com/fangq/mcxcl.git
cd src
make clean
../bin/mcxcl --bench cube60 -G 1 -n 1e7  # running 1e7 photons using 1st GPU
../bin/mcxcl --bench cube60 -G 11 -n 1e7  # running 1e7 photons using 1st+2nd GPUs

on an NVIDIA system with multiple GPUs, the execution time of the last command is the same as the 1st one. I expect it to be 1/2 if the execution is concurrent.

never mind. mystery solved!

it turns out that the shared RO_MEM buffers had caused the serialization of the kernels! it was not the fault of a single context, as I always thought to be.

after duplicating those RO_MEM buffers for each device and assign those duplicated buffer points to each kernel (i.e. clSetKernelArg(mcxkernel[i],... (void*)(buf+i)) ), I was able to get concurrent execution on NVIDIA GPUs, no need for multi-threading/multi-context