Iterating the same kernel many times, asynchronously, is slow in CPU. Plus request for connected components labelling algorithm

vincenzoml · August 5, 2020, 8:30am

In an experiment with using OpenCL for the voxlogica project (search voxlogica on google if you wish, as I can’t include links) we needed to iterate asynchronous kernel execution several times, depending on the previous iteration (using events). I am not surprised that it will be slow in GPU, since it requires global synchronization but it’s slow in CPU, and that’s surprising. It’s an algorithm that works like a cellular automaton, so it requires global synchronization. What surprises me is that, even if the work is offloaded to the GPU, the CPU is still waiting for a lot of time between each iteration.

The code (it’s fsharp, but read it as pseudocode if you wish!):

    for i = 0 to 650 do
        queue.Execute(kernel.[1], null, [|int64 img.BaseImg.Width; int64 img.BaseImg.Height|], null, events)
        if i%2 = 0 then
            kernel.[1].SetMemoryArgument(1, obufs.[0])
            kernel.[1].SetMemoryArgument(2, obufs.[1])
        else
            kernel.[1].SetMemoryArgument(1, obufs.[1])
            kernel.[1].SetMemoryArgument(2, obufs.[0])

Is this because the GPU is sequential, so queue.Execute needs to wait? Or maybe we’re doing something wrong on our side?

Also a separate question: is there a known fast opencl implementation of connected components labelling for images, available for reuse in our free and open source project?

system · February 4, 2021, 8:30am

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.

Iterating the same kernel many times, asynchronously, is slow *in CPU*. Plus request for connected components labelling algorithm

Iterating the same kernel many times, asynchronously, is slow in CPU. Plus request for connected components labelling algorithm