Tesla C2050 - OpenCL - Kernel Concurrency Issue

Hi All,

This problem has some complex background so I will attempt to abstract as much as possible. I’m posting here as well as on the OpenCL forums because my problems are occurring with use of the NVIDIA driver/interface:

  • I have created software for a specialized application, but it is just a glorified RK2 (Runge-Kutta 2) interpolator, use your imagination, I dont think the specifics are important.
  • Ubuntu 12.04 LTS - Xeon Server CPU - 24GB RAM
  • I have 4 Tesla C2050s. One of these is also rendering system graphics.
  • There is some static “vector field”-type data which is loaded on all 4 devices. Then a process estimates (this is AD HOC and user tuneable) the remaining VRAM left for dynamic data (the RK2 “paths”). Out of 6GB VRAM, sometimes we attempt to use ~80% of it (never 100, as there may be fragmentation issues…)
  • Host side, 4 separate threads (std::thread) control 4 separate handlers (classes that manage the GPU kernel queueing, etc) which have access to 4 separate FIFO’s that dictate the RK2 paramaters (TL;DR - each thread is independent as possible, when it comes to host-side operations)

However, I’m having an issue where even though everything is multi-threaded host-side and works from a “correctness” standpoint (the final output data has been validated and is numerically sound), I am encountering some bugs that are causing massive ( as in, DAYS of time massive) computation penalties.

What seems to be happening, according to the gDEBUGGER tool is that, at most, only 2 GPU’s are ever active at the same time, (most of the time it is only one). I am now wondering whether the instantiation of the cl::Context and cl::CommandQueue objects, or something related to how the graphics driver handles OpenCL calls is causing some sort of bottleneck.

Here is how these objects are instantiated, first the context/devices:

void OclEnv::OclInit()

  cl_context_properties con_prop[3] =
    (cl_context_properties) (this->ocl_platforms[0]) (),


  this->ocl_context = cl::Context(CL_DEVICE_TYPE_GPU, con_prop);

  this->ocl_devices = this->ocl_context.getInfo<CL_CONTEXT_DEVICES>();

Then the CommandQueues:

void OclEnv::NewCLCommandQueues()

  for (unsigned int k = 0; k < this->ocl_devices.size(); k++ )
    std::cout<<"Create CommQueue, Kernel, Device: "<<k<<"


Essentially for the total # of devices, d, there is an std::vector of size d for the commandqueues, kernels, cl::Buffers etc. In host memory, there exists one object of this type per device in use.

Then, once the initial loading is finished, computation begins:

  for (int i = 0; i < num_dev; ++i)

    gpu_managers[i] = new std::thread(

  while (particles_fifo->count())
    printf("Processed %i/%i.\r", particles_fifo->count()/2, total_particles);

  for (int i = 0; i < num_dev; ++i)

That first statement just links (passes pointers to) all of the cl:: objects and other used objects to each of the handler instances, then, they are simply put in threads and away they go (but not really, as per this post…). I know that center condition looks funny but it shouldnt affect anything since it’s just being invoked once/second in this thread and not the actual handler threads.

If someone would like to pull the code and look at it, here it is:


However, you will not be able to run it without some large-filesize input data, of which I do not wish to host, (but, if you are curious, etc, I will be happy to share should you message me).

Any help is appreciated and will be given the proper acknowledgement. This is open source software and we hope to put it in use in one or more research applications.