Hi,

I am writing some test code to compare various matrix-vector multiplication routines.

Thus far my code is working, however the transposed multiplication is incredibly slow compared to the normal one. I only changed the indexing, which might be the trouble. How would I loop through the matrix otherwise to make my routine more optimal? The matrix is not stored as the transposed.

Matrix A will be stored in column-major order.

This is the normal routine:

```
__kernel void gemv1(__global const scalar_t * a,__global const scalar_t * x,
__global scalar_t * y,int m,int n)
{
scalar_t sum = 0.0f;
int i = get_global_id(0); // row index
for (int j=0;j<n;j++)
{
sum += a[i + m*j] * x[j];
}
y[i] = sum;
}
```

This is the slow transpose:

```
__kernel void gemvt1(__global const scalar_t * a,__global const scalar_t * x,
__global scalar_t * y,int m,int n)
{
scalar_t sum = 0.0f;
int i = get_global_id(0); // row index
for (int j=0;j<m;j++)
{
sum += a[j + m*i] * x[j];
}
y[i] = sum;
}
```

I have more complex codes blocking the matrices and vectors, but I would like to get the simple code running first.

Thanx in advance!