I suppose most people are aware of this problem - when running an OpenCL program on multiple NVIDIA GPUs, creating a single context with multiple queues (one queue per device) serializes the execution. The only way I found to get around this is to create multiple contexts, one per device, and one queue/program per context, and run this in multiple parallel threads.
for example, here is an earlier report of the this issue:
I just tested on newer nvidia driver (418.56) with 2x Titan Vs, I still see the same behavior.
In comparison, AMD/Intel’s OpenCL allows concurrent executions on multiple queues under the same context.
I would like to check with this forum and see if there is a solution, as of 2020, to run a single kernel on multiple GPUs concurrently on NVIDIA GPUs without needing to create multiple contexts? there are huge overheads associated with duplicating memory buffers over multiple contexts.
This is strange - I just tried the “OpenCL Simple Multi-GPU” example in nvidia opencl sdk:
it appears that in the sample code, the kernel was executed in parallel on multiple GPUs, even though there is only a single context. But my similarly structured code was serialized for some reason
here is a comparison between the launching + waiting part of the SDK example and my code
does anyone see a major difference that prevents my kernel from running in parallel?
to see the execution is serialized, you can run
git clone https://github.com/fangq/mcxcl.git
cd src
make clean
make
../bin/mcxcl --bench cube60 -G 1 -n 1e7 # running 1e7 photons using 1st GPU
../bin/mcxcl --bench cube60 -G 11 -n 1e7 # running 1e7 photons using 1st+2nd GPUs
on an NVIDIA system with multiple GPUs, the execution time of the last command is the same as the 1st one. I expect it to be 1/2 if the execution is concurrent.
it turns out that the shared RO_MEM buffers had caused the serialization of the kernels! it was not the fault of a single context, as I always thought to be.
after duplicating those RO_MEM buffers for each device and assign those duplicated buffer points to each kernel (i.e. clSetKernelArg(mcxkernel[i],... (void*)(buf+i)) ), I was able to get concurrent execution on NVIDIA GPUs, no need for multi-threading/multi-context