I have a program doing heavy work on host side and enqueueing a lot of kernels (such as 50 kernels for a reduction) by it adds too much latency because of that host side sluggishness but device side is faster even though its a low end GPU. To overcome that, is it safe to copy a command queue to another command queue (I don’t know how) and re-enqueue everything from that queue into the original queue in a single API command? If yes, how can I do it? For example some programs such as fluid advection needs even hundreds of kernels at each step.
https://i.snag.gy/Q1bhUB.jpg here it is seen the CPU part enqueue operations take nearly 1/3 of total cycle and the big blue blob is already at maximum pci-e bandwidth (2.78 GB/s (8x 2.0))
Kernel-only device part takes no more than 1.5 ms for 4M element float reduction(sum).
Thank you for your time.