Iterating the same kernel many times, asynchronously, is slow *in CPU*. Plus request for connected components labelling algorithm

In an experiment with using OpenCL for the voxlogica project (search voxlogica on google if you wish, as I can’t include links) we needed to iterate asynchronous kernel execution several times, depending on the previous iteration (using events). I am not surprised that it will be slow in GPU, since it requires global synchronization but it’s slow in CPU, and that’s surprising. It’s an algorithm that works like a cellular automaton, so it requires global synchronization. What surprises me is that, even if the work is offloaded to the GPU, the CPU is still waiting for a lot of time between each iteration.

The code (it’s fsharp, but read it as pseudocode if you wish!):

    for i = 0 to 650 do
        queue.Execute(kernel.[1], null, [|int64 img.BaseImg.Width; int64 img.BaseImg.Height|], null, events)
        if i%2 = 0 then
            kernel.[1].SetMemoryArgument(1, obufs.[0])
            kernel.[1].SetMemoryArgument(2, obufs.[1])
            kernel.[1].SetMemoryArgument(1, obufs.[1])
            kernel.[1].SetMemoryArgument(2, obufs.[0])

Is this because the GPU is sequential, so queue.Execute needs to wait? Or maybe we’re doing something wrong on our side?

Also a separate question: is there a known fast opencl implementation of connected components labelling for images, available for reuse in our free and open source project?

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.