OpenCL on APU

I have AMD Kaveri(4 CPU + 8 GPU). It is recognized as 2 devices - 4CPU, 8 GPU (DEVICE_TYPE_CPU and DEVICE_TYPE_GPU). First, I want to distribute a task among the GPUs. After running the program for a certain number of iterations on GPUs, I want my program to run on CPUs. How can I do that?

I have few questions:

Q1. How can I distribute the task among multiple GPUs in the APU? I want the GPUS to work parallely. There is no concept of device fission for GPU, I guess. Then, how can
I do this?

Q2. How can I distribute the task among multiple CPUs in the APU? Do I need to use device fission for the multicore CPU portion of the APU?

Q3. How to synchronize among GPUS and CPUs. Is it a good idea to use “event”?

I think you misunderstand the purpose of device fissioning (DF).

Normally, if you give OpenCL a GPU or CPU compute task, it will fire that task off on as many computational elements as it takes to execute it as fast as possible. Or more to the point, OpenCL decides for itself how to partition the available computational resources for the various tasks you give it.

DF simply gives you more explicit control over such partitioning.

But the tasks will be distributed in both cases. The computational units will always operate in parallel on tasks. It’s just that, with DF you get some input in how that gets done.

Thanks, Alfonse for your answer. Can you please explain a bit? Is it possible to know/debug how OpenCL is partitioning using the multi-GPU? If I need any explicit control, I am afraid I shall not be able to do it using the GPUs.

I am also struggling with APUs as I could not get good example code for APUs. Can you please refer to one?

Not directly through OpenCL. There may be OpenCL debugging systems that can allow it, but OpenCL itself does not.

I am also struggling with APUs as I could not get good example code for APUs. Can you please refer to one?

Try and read AMD OpenCL Optimization Guide. tl;dr Use clMap instead of Write\Read to allow zero-copy host-device transition, create buffers that do not need to be read by CPU with CL_MEM_AMD_PERSISTENT_MEM flag so GPU does not have to update CPU’s cache. Things like bank conflicts on GPU side and cache misses on CPU side are by far more important for perfomance though.