OpenGL Compute vs OpenCL Performance


I have an OpenCL kernel which I migrated to OpenGL compute. The exact logical code performs 4x worse on OGL compute than OpenCL. The code is basically unpacking a 1D array into a 2D image. OpenCL uses a global 1D array buffer for input while OGL Compute uses a SSBO; that would be the main difference. Local thread group sizes remain the same for the 4x difference test.

If I adjust the OGL compute thread group size to be more cache coherent, I can make it faster, but it’s still 2x slower than the OpenCL implementation. Any thoughts on why this may be? I’d imagine the exact same logic on the same hardware would be comparable…

Also I’ve tried using a GL_TEXTURE_BUFFER as well and the perf numbers are the same. Are memory reads in OpenGL just slower than OpenCL?? They should both just be using L2 caches and not any Texture or L1 caches right?


That’s stange. I used to convert openGL compute to cuda with exactly the same timing…