I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.
the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key
in comparison, here is the float counter-part:
my kernel is a compute-bound kernel.
I don’t know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?
PS: the code can be tested by
git clone https://github.com/fangq/mcxcl.git cd mcxcl git checkout nvidiaomp cd src make clean all cd ../example/benchmark ./run_benchmark1.sh -G 1 -J "-DUSE_HALF"
removing the -J “-DUSE_HALF” option will enable the original fp32 code