Does anyone know a fast arbitrary size matrix multiplication algorithm/code on GPU?

The matrix multiplication from NVIDIA SDK seems only work when input matrix has a size of multiple of 16. For example, if input matrix is 127X127, it returns wrong results.

I assume that what you are doing is having each thread or work-item calculate one item in the resulting matrix. Most of my GPU coding experience is from CUDA, I am still getting used to the OpenCL terminology, the terms I use in my answer may reflect that.

There are a two things you can do to manipulate the example to other sized matrices.

Have the current thread check if it is in the bounds of the resulting matrix, if not have it exit.

Pad your matrices to be a multiple of the blocksize