Long write from private double to global double (buffer)

I have a kernel that works great. I am creating a variable with a memory type (__private) that belongs to the current work_item. In the process, the calculation results are written to this variable. After that I need to transfer it to the buffer, which takes a lot of time. This became clear when I commented out the line (part [iJob] = Result;), where partial is the kernel argument (__global double * partial). After the line is commented out, the execution time is less than one millisecond, if the line is not commented out, then about 9 seconds. At first I suggested that this is a slow buffer, but if you specify any other variable instead of Result, then everything will become normal. How to deal with this problem and can it be my mistake?

After the line is commented out, the execution time is less than one millisecond […]

I suspect that compiling the line out is enabling the compiler to dead code eliminate large parts of your kernel, which is causing the execution time to dramatically change. Do you have a way to check whether the code size of your kernel is similar before and after this change?

If not, you could probably compare the lines of LLVM IR code generated by Clang as a reasonable proxy. Could you try performing this experiment using the online Compiler Explorer?

OK, I will try. It is worth noting that if you try to write to the buffer any other number or variable, then the execution speed is also high