Matrix multiplication - Different behaviors in different devices

I am experimenting opencl with examples from the book OpenCL in Action. I found different behaviors on different devices. Below is the kernel code about matrix multiplication I am in trouble with.

__kernel void matrix_mult(__global float4 *a_mat, 
      __global float4 *b_mat, __global float *c_mat) {

   float sum;

   int num_rows = get_global_size(0);
   int vectors_per_row = num_rows/4;
   int start = get_global_id(0) * vectors_per_row;
   a_mat += start;
   c_mat += start*4;

   for(int i=0; i<num_rows; i++) {
      sum = 0.0f;
      for(int j=0; j<vectors_per_row; j++) {
         sum += dot(a_mat[j], b_mat[i * vectors_per_row + j]);
      }
      c_mat[i] = sum;
   }
}

I tested with AMD GPU, Intel GPU and Intel CPU. All three devices can complete the kernel if the matrix size is 1024x1024 (the kernel is executed with global size set to 1024). However, if I increase the matrix size to 2048x2048 (the kernel is then executed with global size set to 2048), although the AMD GPU and Intel CPU can complete gratefully, the execution of the kernel using Intel GPU hanged without return.

The issue seems devices specific. If I commented out the line inside the for loop (i.e. the line with sum+= dot…), then Intel GPU can complete the kernel execution.

I wonder the issue may be related to the conflict of global memory access of a_mat and b_mat across different processing elements.

May any experts offer me any advice to figure out a solution?

I got advice from Intel that the GPU hardware I am using is no longer supported by Intel Studio 2019. My issue is likely hardware compatibility matter. This case can be closed.