I use minimal precision in OpenCL too and i don’t use any native stuff or vendor extensions yet. The result is equal for CL and VK branch while CPU version produces slightly different results. (i don’t use transcendental math functions, just sqrt)
VK has an advantage that we set memory barriers only when needed while OpenCL has to figure out itself. But in my case i have dependencies on previous results almost always so probably this does not really affect my numbers.
But it’s worth to mention i measure time for CL only on CPU after calling clFinish. For VK i use proper profiling with GPU timestamps.
That’s not really good but if we assume i waste 0.3ms with the finish, we would still talk about 1.4 vs. 2 ms.
For game dev there is the opinion that VK / DX12 makes OpenCL / Cuda needless and i agree as long as CL 2.0 with data sharing is no option.
What’s still missing is fine grained async compute. Memory barriers are the main performance issue for me. It would be great if we could do some work while this is happening.