In an experiment with using OpenCL for the voxlogica project (search voxlogica on google if you wish, as I can’t include links) we needed to iterate asynchronous kernel execution several times, depending on the previous iteration (using events). I am not surprised that it will be slow in GPU, since it requires global synchronization but it’s slow in CPU, and that’s surprising. It’s an algorithm that works like a cellular automaton, so it requires global synchronization. What surprises me is that, even if the work is offloaded to the GPU, the CPU is still waiting for a lot of time between each iteration.
The code (it’s fsharp, but read it as pseudocode if you wish!):
for i = 0 to 650 do queue.Execute(kernel., null, [|int64 img.BaseImg.Width; int64 img.BaseImg.Height|], null, events) if i%2 = 0 then kernel..SetMemoryArgument(1, obufs.) kernel..SetMemoryArgument(2, obufs.) else kernel..SetMemoryArgument(1, obufs.) kernel..SetMemoryArgument(2, obufs.)
Is this because the GPU is sequential, so queue.Execute needs to wait? Or maybe we’re doing something wrong on our side?
Also a separate question: is there a known fast opencl implementation of connected components labelling for images, available for reuse in our free and open source project?