Parallel kernel in devices

user1900 · October 8, 2020, 8:21pm

What is the current state of launching a kernel in two devices in parallel, each one computing partially the results to reduce the total time (compared with a single device)?

Does sycl provide a mechanism to do it efficiently?

Currently with this strategy the overheads are high:

1. launch two threads
2. each thread contains one sycl scope
3. each scope computes kernels/data transfers
4. each scope synchronizes all together (mutex/threads)
5. loop to 3 if more work needs to be computed by any thread/scope/device

The observed behavior is: the computation time (data transfer+kernel) is somehow proportional, but the rest (queue, scope, function, …) is higher than when using a single device. Therefore, we need huge problems to compensate (or not even then).

Edit to provide an example. To simplify, I remove the step 5 (loop), but the idea is the same. I want to know if this would be the expected parallel usage of 2 devices with SYCL (to achieve the best performance):

void do_gaussian_cpu(bool cpu, Opts opts, Gaussian* gaussian){
  // using opts and gaussian

  { // oneapi scope
    cpu_selector sel;

    {
      std::lock_guard<std::mutex> lk(*opts.m);
      // spliting the work load
      // the first work load is for the first device to enter, etc
      // calculate the offset and size for this work chunk
    }

    queue q(sel);

    auto R = sycl::range<1>(size); // its partial size
    auto Rinput = sycl::range<1>(gaussian->_total_size);
    auto Rfilter = sycl::range<1>(gaussian->_filter_total_size);

    sycl::buffer<cl_float, 1> buf_filterWeight(gaussian->_b.data(), Rfilter);
    sycl::buffer<cl_uchar4, 1> buf_blurred((gaussian->_c.data() + offset), R);

    auto submit_event = q.submit([&](handler &h) {
                                   // ... the same as before ...
                                 });
    submit_event.wait();
  }
}
void do_gaussian_gpu(bool cpu, Opts opts, Gaussian* gaussian){
  // using opts and gaussian

  { // oneapi scope
    gpu_selector sel;

    {
      std::lock_guard<std::mutex> lk(*opts.m);
      // spliting the work load
      // the first work load is for the first device to enter, etc
      // calculate the offset and size for this work chunk
    }

    queue q(sel);

    auto R = sycl::range<1>(size);  // its partial size
    auto Rinput = sycl::range<1>(gaussian->_total_size);
    auto Rfilter = sycl::range<1>(gaussian->_filter_total_size);

    sycl::buffer<cl_float, 1> buf_filterWeight(gaussian->_b.data(), Rfilter);
    sycl::buffer<cl_uchar4, 1> buf_blurred((gaussian->_c.data() + offset), R);

    auto submit_event = q.submit([&](handler &h) {
                                   // ... the same as before ...
                                 });
    submit_event.wait();   
  }
}

void main(){
  // using two independent funcs, one for cpu, the other for gpu
  if (use_cpu_and_gpu){
    std::thread t1(do_gaussian_cpu, true, opts, &gaussian);
    do_gaussian_gpu(false, opts, &gaussian);
    t1.join();
  }
}

rodcplay · October 13, 2020, 9:20am

You don’t need to use a thread for the CPU execution. You just need to set up a queue for each device then submit your kernel to the queue to run the execution in parallel. This lesson/presentation might help you with your understanding of how this works.

user1900 · October 14, 2020, 10:50am

Thanks.

I read it and it doesn’t expose any examples of parallel execution.

I think I read somewhere that if the SYCL scope has two queues for independent devices and there are no data dependencies, the runtime will execute in parallel, but I don’t know how to extract if that region is computed in parallel (apart from getting the total time and comparing with the expected one). What I am finding is high overheads.
Also, I don’t know if the runtime/compiler will be smart enough to understand that when I am doing complex data workloads, splitting different regions of the same memory. That’s why I tried with the multithreaded approach, being explicit with two independent scopes. But I would like to know if that is the most efficient way for SYCL to partition and compute data for different devices in parallel (not different kernels, but the same for both).

Also, I put the CPU as an example. But it could be any others, like only 2 GPU.

Mike_Kinsner · October 14, 2020, 12:19pm

Hi,

A good SYCL runtime should be able to launch different kernels (or different instances of a kernel) concurrently on different devices, if there are no data or other dependencies between those executions. The specification doesn’t require this but it’s reasonable to expect.

If you see overheads, the next step is to use profiling tools or layers (e.g. Intercept Layer for OpenCL Applications for OpenCL devices).

splitting different regions of the same memory

If both kernels operate on the same buffer, there will be a data dependence. In general an implementation can’t know what addresses within a buffer will be updated by a kernel, so a data dependence will be created in the task graph to guarantee correctness. The buffer may also need to be copied between the two devices between the kernel executions.

One mechanism that addresses this in SYCL is the sub-buffer concept. You can split a buffer into sub-buffers, and create accessors to the sub-buffers in different kernels. The primary use case of sub-buffers is to solve this class of problem, where different kernels want to operate on different non-overlapping regions of the buffer concurrently.

If you have buffers that are only read from (or only written to), make sure that you’re using appropriate access modes on the accessors. That provides additional information to the SYCL runtime that may enable more task graph optimizations and result in fewer data dependencies.

What is the current state of launching a kernel in two devices in parallel, each one computing partially the results to reduce the total time (compared with a single device)?

SYCL fully supports this. Whether there is a benefit to doing so depends very much on the application and runtime, on how data needs to be shared or copied between devices, whether copying of data to multiple devices needs to share the same communication bus as opposed to independent bandwidth to the multiple devices, and other factors.

You should definitely use profiling tools or layers to check where the time in your application is being spent, as a next step. That should provide insight into where the overhead you’re seeing is coming from.

system · April 12, 2021, 12:19pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.