What is the best solution to return Dynamic Results from a Kernel

Dear *,
I am working on a simple kernel to acquire contour pixels in a provided and large binary image data(e.g. 20.000 x 20.000) and here is the issue,
As the number of to be returned contour pixels can vary, I can not provide a fixed-size buffer to the kernel to be filled-out, so I am looking for a best solution to having a minimum running time.
200ms was the execution time if I could provided a large enough buffer to contain the largest possible result but this is not possible and not efficient. Using divide and conquer method applying the maximum GPU capabilities in an iteration(compute units and work group size) I have reached ~ 1200ms, as I had to read back each iteration result.

I wonder if experienced OpenCL guys can help here as this is my first experience with it, appreciate your time and help in advance.

Best Regards,