OpenCL Optimised SGEMM implementation

I am trying to implement SGEMM for integrated GPUs like ARM MALI Midgard GPUS or Intel GPUs.
The issue is The gpu versions of the implementations are quite slower than the cpu implementation. I have arrays in row-major form and I don’t wish to do any memory reshaping. What’s the best way to do this?