[Solved] Sequential execution on NVIDIA OpenCL using multiple GPUs

fangqq · August 7, 2020, 10:12pm

I suppose most people are aware of this problem - when running an OpenCL program on multiple NVIDIA GPUs, creating a single context with multiple queues (one queue per device) serializes the execution. The only way I found to get around this is to create multiple contexts, one per device, and one queue/program per context, and run this in multiple parallel threads.

for example, here is an earlier report of the this issue:

I just tested on newer nvidia driver (418.56) with 2x Titan Vs, I still see the same behavior.

In comparison, AMD/Intel’s OpenCL allows concurrent executions on multiple queues under the same context.

I would like to check with this forum and see if there is a solution, as of 2020, to run a single kernel on multiple GPUs concurrently on NVIDIA GPUs without needing to create multiple contexts? there are huge overheads associated with duplicating memory buffers over multiple contexts.

fangqq · August 8, 2020, 12:43am

This is strange - I just tried the “OpenCL Simple Multi-GPU” example in nvidia opencl sdk:

it appears that in the sample code, the kernel was executed in parallel on multiple GPUs, even though there is only a single context. But my similarly structured code was serialized for some reason

here is a comparison between the launching + waiting part of the SDK example and my code

github.com

QI1002/library/blob/10ccff7421443e9f22b1d441d675d205998ed71b/NVIDIA_GPU_Computing_SDK/OpenCL/src/oclSimpleMultiGPU/oclSimpleMultiGPU.cpp#L261-L278


      
          for(unsigned int i = 0; i < ciDeviceCount; i++) 
          {        
              ciErrNum = clEnqueueNDRangeKernel(commandQueue[i], reduceKernel[i], 1, 0, globalWorkSize, localWorkSize,
                                               0, NULL, &GPUExecution[i]);
              oclCheckError(ciErrNum, CL_SUCCESS);
          }
          
          
// Copy result from device to host for each device
          for(unsigned int i = 0; i < ciDeviceCount; i++) 
          {
              ciErrNum = clEnqueueReadBuffer(commandQueue[i], d_Result[i], CL_FALSE, 0, ACCUM_N * sizeof(float), 
                                  h_SumGPU + i *  ACCUM_N, 0, NULL, &GPUDone[i]);
              oclCheckError(ciErrNum, CL_SUCCESS);
          }
          
          
// Synchronize with the GPUs and do accumulated error check
          clWaitForEvents(ciDeviceCount, GPUDone);
          shrLog("clWaitForEvents complete...\n\n");

github.com

fangq/mcxcl/blob/master/src/mcx_host.cpp#L679-L725


      
          if (cfg->seed != SEED_FROM_FILE) {
              Pseed = (RandType*)malloc(sizeof(RandType) * gpu[i].autothread * RAND_BUF_LEN);
              cl_uint* iseed = (cl_uint*)Pseed;
          
          
    for (j = 0; j < gpu[i].autothread * RAND_SEED_LEN; j++) {
                  iseed[j] = rand();
              }
          
          
    OCL_ASSERT(((gseed[i] = clCreateBuffer(mcxcontext, RW_MEM, sizeof(RandType) * gpu[i].autothread * RAND_BUF_LEN, Pseed, &status), status)));
          }
          
          
OCL_ASSERT(((gfield[i] = clCreateBuffer(mcxcontext, RW_MEM, sizeof(cl_float) * fieldlen * 2, field, &status), status)));
          
          
if (cfg->issavedet) {
              OCL_ASSERT(((gdetphoton[i] = clCreateBuffer(mcxcontext, RW_MEM, sizeof(float) * cfg->maxdetphoton * hostdetreclen, Pdet, &status), status)));
          }
          
          
OCL_ASSERT(((genergy[i] = clCreateBuffer(mcxcontext, RW_MEM, sizeof(float) * (gpu[i].autothread << 1), energy, &status), status)));
          OCL_ASSERT(((gdetected[i] = clCreateBuffer(mcxcontext, RW_MEM, sizeof(cl_uint), &detected, &status), status)));

This file has been truncated. show original

does anyone see a major difference that prevents my kernel from running in parallel?

to see the execution is serialized, you can run

git clone https://github.com/fangq/mcxcl.git
cd src
make clean
make
../bin/mcxcl --bench cube60 -G 1 -n 1e7  # running 1e7 photons using 1st GPU
../bin/mcxcl --bench cube60 -G 11 -n 1e7  # running 1e7 photons using 1st+2nd GPUs

on an NVIDIA system with multiple GPUs, the execution time of the last command is the same as the 1st one. I expect it to be 1/2 if the execution is concurrent.

fangqq · August 8, 2020, 6:33pm

never mind. mystery solved!

it turns out that the shared RO_MEM buffers had caused the serialization of the kernels! it was not the fault of a single context, as I always thought to be.

after duplicating those RO_MEM buffers for each device and assign those duplicated buffer points to each kernel (i.e. clSetKernelArg(mcxkernel[i],... (void*)(buf+i)) ), I was able to get concurrent execution on NVIDIA GPUs, no need for multi-threading/multi-context

system · February 7, 2021, 6:41pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.