In an experiment with using OpenCL for the voxlogica project (search voxlogica on google if you wish, as I can’t include links) we needed to iterate asynchronous kernel execution several times, depending on the previous iteration (using events). I am not surprised that it will be slow in GPU, since it requires global synchronization but it’s slow in CPU, and that’s surprising. It’s an algorithm that works like a cellular automaton, so it requires global synchronization. What surprises me is that, even if the work is offloaded to the GPU, the CPU is still waiting for a lot of time between each iteration.
The code (it’s fsharp, but read it as pseudocode if you wish!):
for i = 0 to 650 do
queue.Execute(kernel.[1], null, [|int64 img.BaseImg.Width; int64 img.BaseImg.Height|], null, events)
if i%2 = 0 then
kernel.[1].SetMemoryArgument(1, obufs.[0])
kernel.[1].SetMemoryArgument(2, obufs.[1])
else
kernel.[1].SetMemoryArgument(1, obufs.[1])
kernel.[1].SetMemoryArgument(2, obufs.[0])
Is this because the GPU is sequential, so queue.Execute needs to wait? Or maybe we’re doing something wrong on our side?
Also a separate question: is there a known fast opencl implementation of connected components labelling for images, available for reuse in our free and open source project?