Reading from GPU memory is super slow

I am trying to copy about a megabyte or two of pixel data, but it takes 19s on modern hardware(12400F, 1060 6GB and 3200MHz CL16 ram)
Code is here if you need the full code: GitHub - StiglCZ/RayTracer: Attempt at a raytracer

I have already tried running the read asyncronously and then doing queue.finish(), still took same amount of time, I also tried using the CL_MEM_ALLOC_HOST or however its called, still takes same amount of time, CL_USE_HOST_PTR crashes, not really sure what to do from here.
Is writing 3x to the output buffer the problem?
Should I create a separate kernel just to copy the data or something?
Is my memory space being __global wrong?
Please send help, because 19s per megabyte seems just… wrong