Small matrix 4x4 multiplication in OpenCL kernel

I have to do frequent 4x4 matrix multiplication in kernel,my matrix stored in float16 .Is there any way to do this efficiently using vectorization inside kernel?

Constant sized loops in the private memory space should be a breeze for an optimizing compiler to handle in the most efficient way. You can to try to manually unroll your function and shuffle the operations randomly, but I doubt it will be more efficient.

I did it manual multiplication ,as dot function doesnt work on Nvidia GPU.I have implemented in below way,is it possiable to optimize the code?

int16 matrixMult4x4f(int16 M, int16 N)
//return M.sAN.s1;
int16 tmp = (int16){M.s0
N.s0+M.s1N.s4+M.s2N.s8+M.s3N.sC , M.s0N.s1+M.s1N.s5+M.s2N.s9+M.s3N.sD , M.s0N.s2+M.s1N.s6+M.s2N.sA+M.s3N.sE , M.s0N.s3+M.s1N.s7+M.s2N.sB+M.s3*N.sF ,

M.s4N.s0+M.s5N.s4+M.s6N.s8+M.s7N.sC , M.s4N.s1+M.s5N.s5+M.s6N.s9+M.s7N.sD , M.s4N.s2+M.s5N.s6+M.s6N.sA+M.s7N.sE , M.s4N.s3+M.s5N.s7+M.s6N.sB+M.s7N.sF ,

M.s8N.s0+M.s9N.s4+M.sAN.s8+M.sBN.sC , M.s8N.s1+M.s9N.s5+M.sAN.s9+M.sBN.sD , M.s8N.s2+M.s9N.s6+M.sAN.sA+M.sBN.sE , M.s8N.s3+M.s9N.s7+M.sAN.sB+M.sBN.sF ,

M.sCN.s0+M.sDN.s4+M.sEN.s8+M.sFN.sC , M.sCN.s1+M.sDN.s5+M.sEN.s9+M.sFN.sD , M.sCN.s2+M.sDN.s6+M.sEN.sA+M.sFN.sE , M.sCN.s3+M.sDN.s7+M.sEN.sB+M.sFN.sF};

return tmp;

If there is, compiler already does that for you.