OpenCL Optimised SGEMM implementation

I am trying to implement SGEMM for integrated GPUs like ARM MALI Midgard GPUS or Intel GPUs.
The issue is The gpu versions of the implementations are quite slower than the cpu implementation. I have arrays in row-major form and I don’t wish to do any memory reshaping. What’s the best way to do this?

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.