fatal: si_isa_DS_WRITE_B32_impl: invalid address

hi, everybody,

I’m doing Matrix Multiplication with OPENCL,
I split the multiplication into some work groups,
then i add them into global memory.

the code below is the final step that sum the sub result to final result

for(k=0 ; k<group_num ; ++k)

  region = (group_id+k)%group_num;
  	c_mat[(8*region+0)*matrix_size+l] += local_output_matrix[(8*region+0)*matrix_size+l];
  	c_mat[(8*region+1)*matrix_size+l] += local_output_matrix[(8*region+1)*matrix_size+l];
  	c_mat[(8*region+2)*matrix_size+l] += local_output_matrix[(8*region+2)*matrix_size+l];
  	c_mat[(8*region+3)*matrix_size+l] += local_output_matrix[(8*region+3)*matrix_size+l];
  	c_mat[(8*region+4)*matrix_size+l] += local_output_matrix[(8*region+4)*matrix_size+l];
  	c_mat[(8*region+5)*matrix_size+l] += local_output_matrix[(8*region+5)*matrix_size+l];
  	c_mat[(8*region+6)*matrix_size+l] += local_output_matrix[(8*region+6)*matrix_size+l];
  	c_mat[(8*region+7)*matrix_size+l] += local_output_matrix[(8*region+7)*matrix_size+l];


when the size is 64, this code worked,
but when size increased to 128,
the kernel failed and sent the message: fatal: si_isa_DS_WRITE_B32_impl: invalid address.

but if i write

c_mat[(8*region+0)*matrix_size+l] += const ; or

temp += local_output_matrix[(8*region+7)*matrix_size+l];

the kernel worked, but the answer is wrong obviously.

So do any body had met this fatal error code?

thanks for your help

Without seeing the code I can only guess, but it sounds like you are going over the amount of private or shared memory.