Same kernels, but wrong result on MALI

Hello Khronos Community,

I have a pipeline of OpenCL kernels that are being applied one after another. So basically the pattern is:

output [ i ] = kernel [ i ] (output [ i - 1 ])

The pipeline consists of many stages. On Intel HD 530 and nVidia Quadro the pipeline runs perfectly. But on MALI T880 (Galaxy S7) correct results are achieved only up to stage 50. The interesting part is that until that stage MALI shows absolutely identical results with the other 2 GPUs. It can’t be a memory problem, because all allocated buffers are less than 70MB and on S7 one can allocate up to 1/4 * 4GB RAM. Synchronization can’t be neither, because I use clFinish(queue) before reading the results. All three devices use OpenCL 1.2.

I suppose that it might be a hardware problem, but before I take that assumption I would like to see if someone else has encountered that before.

Any suggestions are very welcome!

Best Wishes,
James.