What is the current state of launching a kernel in two devices in parallel, each one computing partially the results to reduce the total time (compared with a single device)?
Does sycl provide a mechanism to do it efficiently?
Currently with this strategy the overheads are high:
1. launch two threads
2. each thread contains one sycl scope
3. each scope computes kernels/data transfers
4. each scope synchronizes all together (mutex/threads)
5. loop to 3 if more work needs to be computed by any thread/scope/device
The observed behavior is: the computation time (data transfer+kernel) is somehow proportional, but the rest (queue, scope, function, …) is higher than when using a single device. Therefore, we need huge problems to compensate (or not even then).
Edit to provide an example. To simplify, I remove the step 5 (loop), but the idea is the same. I want to know if this would be the expected parallel usage of 2 devices with SYCL (to achieve the best performance):
void do_gaussian_cpu(bool cpu, Opts opts, Gaussian* gaussian){
// using opts and gaussian
{ // oneapi scope
cpu_selector sel;
{
std::lock_guard<std::mutex> lk(*opts.m);
// spliting the work load
// the first work load is for the first device to enter, etc
// calculate the offset and size for this work chunk
}
queue q(sel);
auto R = sycl::range<1>(size); // its partial size
auto Rinput = sycl::range<1>(gaussian->_total_size);
auto Rfilter = sycl::range<1>(gaussian->_filter_total_size);
sycl::buffer<cl_float, 1> buf_filterWeight(gaussian->_b.data(), Rfilter);
sycl::buffer<cl_uchar4, 1> buf_blurred((gaussian->_c.data() + offset), R);
auto submit_event = q.submit([&](handler &h) {
// ... the same as before ...
});
submit_event.wait();
}
}
void do_gaussian_gpu(bool cpu, Opts opts, Gaussian* gaussian){
// using opts and gaussian
{ // oneapi scope
gpu_selector sel;
{
std::lock_guard<std::mutex> lk(*opts.m);
// spliting the work load
// the first work load is for the first device to enter, etc
// calculate the offset and size for this work chunk
}
queue q(sel);
auto R = sycl::range<1>(size); // its partial size
auto Rinput = sycl::range<1>(gaussian->_total_size);
auto Rfilter = sycl::range<1>(gaussian->_filter_total_size);
sycl::buffer<cl_float, 1> buf_filterWeight(gaussian->_b.data(), Rfilter);
sycl::buffer<cl_uchar4, 1> buf_blurred((gaussian->_c.data() + offset), R);
auto submit_event = q.submit([&](handler &h) {
// ... the same as before ...
});
submit_event.wait();
}
}
void main(){
// using two independent funcs, one for cpu, the other for gpu
if (use_cpu_and_gpu){
std::thread t1(do_gaussian_cpu, true, opts, &gaussian);
do_gaussian_gpu(false, opts, &gaussian);
t1.join();
}
}