Colaesed reading taking more time


I have a matrix for which the number of input row is 316 and number of columns is 1936. The global size of the kernel is also set as the same. The local range for the kernel is set as 128 rows and 8 columns. I also use an arbitrary sized local memory buffer to hold the column data from the matrix in the local memory. I just perform the read operation and write to the 8 buffers. But it is taking almost time like 400 micro secs. I am confused why this takes so much of time. I shall copy paste my kernel code below.

__kernel void SparseMatrixMulNew( __write_only image2d_t out_view,
__global float8* SparseMatrix,
__global float* in_view,
int in_offset,
int out_offset,
uint size,
uint in_rows,
int rows,
int outputRows,
uint sigma_left,
int upsampling_factor)

int y = get_global_id(1);
int x = get_global_id(0);
int local_id_x = get_local_id(0);
int local_id_y = get_local_id(1);
int local_size_x = get_local_size(0);
int local_size_y = get_local_size(1);
int group_id_y = get_group_id(1);
int local_index;

int channel_loop;
int mult_loop;
int copy_loop;
float4 col;
__local float local_buf[4][1024];
float sum = 0.0f;

if( x >= chans || y  >= rows )

for(copy_loop = local_id_y; copy_loop < in_rows ; copy_loop += local_size_y)
	local_index = min(copy_loop * chans + x, (int)size);
	local_buf[local_id_x][copy_loop] = in_view[local_index];


Here the logic is that each thread in work group in y direction reads different data from the matrix and place in the same array. Same ways all the buffer will be filled by the 8 set of work items in same columns. But this is colaesed as per my understanding. So it should never take 400 micro secs for kernel execution. please let me know why this happens.