Callbacks in sycl

Can SYCL use callbacks like OpenCL?

The purpose is to:

  1. enqueue kernel operation
  2. async the runtime calls the callback function when the kernel finishes
  3. callback function: do some management and enqueue read operation
  4. async the runtime calls the second callback function when the read op finishes
  5. second callback function: do some management, continue in 1 or do other stuff.

How can be done this with SYCL? What could be the most performant way? I need to do such decisions and management as soon as the op finishes, and dynamically do new ops.

In SYCL 1.2.1 it’s not possible to use callbacks but you can do it with the underlying OpenCL objects. It should be possible to do it without relying on callbacks. Perhaps something like dedicating a thread to it which blocks on memory reads (i.e. a host accessor) or maybe just waits on the kernel finishing.

In SYCL 2020 you can enqueue functions to run on the host using a host task, though the caveat to this is the function is enqueued onto the DAG so is subject to data dependencies and therefore isn’t guaranteed not to be reordered by the runtime, though generally if you enqueue the host task immediately after enqueueing the kernel it will run right after the kernel completes, if you want to guarantee no reordering then you could use an in-order queue (also in SYCL 2020). This does mean waiting for a SYCL implementation with the host task implementation.

Hi,

Thank you.

How can you block on memory reads (host accessor)?

And maybe related, how can I get the time it took to do the data transfer (reading or writing) and in the kernel, independently?

Example:

auto submit_event = q.submit([&](handler &h) {
  auto a = buf_a.get_access<sycl::access::mode::read>(h);
  auto b = buf_b.get_access<sycl::access::mode::read>(h);
  auto c = buf_c.get_access<sycl::access::mode::write>(h);

  h.parallel_for<class test_event>(R, [=](item<2> item) {
      auto sum = 0;
      for (size_t k = 0; k < N; ++k) {
        const auto a_ik = a[{item[0], k}];
        const auto b_kj = b[{k, item[1]}];
        sum += a_ik * b_kj;
      }
      c[item] = sum;
    });

});

submit_event.wait(); 

I assume the submit_event.wait() waits for data write + compute kernel + data read as a whole. And then with submit_event.get_profiling_info<sycl::info::event_profiling::command_start> and such i can get the profiling info for all the operations, but I don’t know how to distinguish between data and kernel.

Hi @user1900,

How can you block on memory reads (host accessor)?

So a host accessor will block access to the buffer for its lifetime, so when it is constructed it will wait for any kernels accessing it to complete before the constructor returns and any other access to the buffer will wait for the host accessor to be destroyed. However, there is no way to measure the time taken for the data transfer in this case.

And maybe related, how can I get the time it took to do the data transfer (reading or writing) and in the kernel, independently?

To separate the data transfer and the kernel function execution you can use the asynchronous copy command which can be called from within a command group just like parallel_for. Todo this you would create the buffers as uninitialized and copy the data from the host pointers asynchronously, for example:

auto copy_a_event = q.submit([&](handler &h) {
  auto a = buf_a.get_access<sycl::access::mode::write>(h);
  h.copy(host_a, a);
});

auto copy_b_event = q.submit([&](handler &h) {
  auto b = buf_b.get_access<sycl::access::mode::write>(h);
  h.copy(host_b, b);
});

auto kernel_event = q.submit([&](handler &h) {
  auto a = buf_a.get_access<sycl::access::mode::read>(h);
  auto b = buf_b.get_access<sycl::access::mode::read>(h);
  auto c = buf_c.get_access<sycl::access::mode::write>(h);

  h.parallel_for<class test_event>(R, [=](item<2> item) {
      auto sum = 0;
      for (size_t k = 0; k < N; ++k) {
        const auto a_ik = a[{item[0], k}];
        const auto b_kj = b[{k, item[1]}];
        sum += a_ik * b_kj;
      }
      c[item] = sum;
    });
});

auto copy_c_event = q.submit([&](handler &h) {
  auto c = buf_c.get_access<sycl::access::mode::read>(h);
  h.copy(c, host_c);
});

q.wait();

Now you have separate events copy_a_event, copy_b_event and copy_c_event which can be queried for profiling information separately from kernel_event.

I hope this helps.

Gordon