Small matrix 4x4 multiplication in OpenCL kernel

rajatkoner · July 17, 2016, 4:02pm

Hi,
I have to do frequent 4x4 matrix multiplication in kernel,my matrix stored in float16 .Is there any way to do this efficiently using vectorization inside kernel?

Salabar · July 18, 2016, 2:44am

Constant sized loops in the private memory space should be a breeze for an optimizing compiler to handle in the most efficient way. You can to try to manually unroll your function and shuffle the operations randomly, but I doubt it will be more efficient.

rajatkoner · July 21, 2016, 12:47pm

I did it manual multiplication ,as dot function doesnt work on Nvidia GPU.I have implemented in below way,is it possiable to optimize the code?

int16 matrixMult4x4f(int16 M, int16 N)
{
//return M.sAN.s1;
int16 tmp = (int16){M.s0N.s0+M.s1N.s4+M.s2N.s8+M.s3N.sC , M.s0N.s1+M.s1N.s5+M.s2N.s9+M.s3N.sD , M.s0N.s2+M.s1N.s6+M.s2N.sA+M.s3N.sE , M.s0N.s3+M.s1N.s7+M.s2N.sB+M.s3*N.sF ,

M.s4N.s0+M.s5N.s4+M.s6N.s8+M.s7N.sC , M.s4N.s1+M.s5N.s5+M.s6N.s9+M.s7N.sD , M.s4N.s2+M.s5N.s6+M.s6N.sA+M.s7N.sE , M.s4N.s3+M.s5N.s7+M.s6N.sB+M.s7N.sF ,

M.s8N.s0+M.s9N.s4+M.sAN.s8+M.sBN.sC , M.s8N.s1+M.s9N.s5+M.sAN.s9+M.sBN.sD , M.s8N.s2+M.s9N.s6+M.sAN.sA+M.sBN.sE , M.s8N.s3+M.s9N.s7+M.sAN.sB+M.sBN.sF ,

M.sCN.s0+M.sDN.s4+M.sEN.s8+M.sFN.sC , M.sCN.s1+M.sDN.s5+M.sEN.s9+M.sFN.sD , M.sCN.s2+M.sDN.s6+M.sEN.sA+M.sFN.sE , M.sCN.s3+M.sDN.s7+M.sEN.sB+M.sFN.sF};

return tmp;
}

Salabar · July 22, 2016, 1:43am

If there is, compiler already does that for you.