I’d like to see a more accurate ULP accuracy profile and someway to detect or request it.
If you compare the CUDA minimum ulp accuracy information with the opencl spec you will find that the CUDA specification requires a higher minimum accuracy.
i.e. If you write to the CUDA api you are guaranteed a higher accuracy.
Also there are no equivalents of the CUDA -prec-sqrt=true and -prec-div=true
How about a ALU and memory transfer instruction cycle count profile? One could do a benchmark to make it, but if a vendor provides that information in a manual then they might also provide it in OpenCL. This kind of information could be useful when deciding where to run kernels, on the CPU, GPU, or etc.
I’ve forwarded this thread to the spec editor.
Personal comment: DRAM memory latency is not a deterministic value. Also, ALU latencies are (a) something that hardware vendors would probably rather not disclose and (b) again, possibly non-deterministic.
The only way to know which device will run a kernel faster is actually running it.