Iterating the same kernel many times, asynchronously, is slow *in CPU*. Plus request for connected components labelling algorithm

In an experiment with using OpenCL for the voxlogica project (search voxlogica on google if you wish, as I can’t include links) we needed to iterate asynchronous kernel execution several times, depending on the previous iteration (using events). I am not surprised that it will be slow in GPU, since it requires global synchronization but it’s slow in CPU, and that’s surprising. It’s an algorithm that works like a cellular automaton, so it requires global synchronization. What surprises me is that, even if the work is offloaded to the GPU, the CPU is still waiting for a lot of time between each iteration.

The code (it’s fsharp, but read it as pseudocode if you wish!):

    for i = 0 to 650 do
        queue.Execute(kernel.[1], null, [|int64 img.BaseImg.Width; int64 img.BaseImg.Height|], null, events)
        if i%2 = 0 then
            kernel.[1].SetMemoryArgument(1, obufs.[0])
            kernel.[1].SetMemoryArgument(2, obufs.[1])
        else
            kernel.[1].SetMemoryArgument(1, obufs.[1])
            kernel.[1].SetMemoryArgument(2, obufs.[0])

Is this because the GPU is sequential, so queue.Execute needs to wait? Or maybe we’re doing something wrong on our side?

Also a separate question: is there a known fast opencl implementation of connected components labelling for images, available for reuse in our free and open source project?