disappointing half-precision performance - any advice?

fangqq · November 19, 2017, 1:58pm

I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.

the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key

in comparison, here is the float counter-part:

github.com

fangq/mcxcl/blob/master/src/mcx_core.cl#L252-L288


      
          
          
static float rand_uniform01(__private RandType t[RAND_BUF_LEN]) {
              return xorshift128p_nextf(t);
          }
          
          
static void xorshift128p_seed (__global uint* seed, __private RandType t[RAND_BUF_LEN]) {
              t[0] = (ulong)seed[0] << 32 | seed[1] ;
              t[1] = (ulong)seed[2] << 32 | seed[3];
          }
          
          
static void gpu_rng_init(__private RandType t[RAND_BUF_LEN], __global uint* n_seed, int idx) {
              xorshift128p_seed((n_seed + idx * RAND_SEED_LEN), t);
          }
          
          
#endif
          
          
float rand_next_scatlen(__private RandType t[RAND_BUF_LEN]);
          
          
float rand_next_scatlen(__private RandType t[RAND_BUF_LEN]) {
              return -MCX_MATHFUN(log)(rand_uniform01(t) + EPS);

This file has been truncated. show original

my kernel is a compute-bound kernel.

I don’t know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?

thanks

PS: the code can be tested by

git clone https://github.com/fangq/mcxcl.git
cd mcxcl
git checkout nvidiaomp
cd src
make clean all
cd ../example/benchmark
./run_benchmark1.sh -G 1 -J "-DUSE_HALF"

removing the -J “-DUSE_HALF” option will enable the original fp32 code

Salabar · November 19, 2017, 2:48pm

Vega improves half precision performance by introducing packed arithmetics similiar to SSE for x86.

This means that such code

half2 a = array[0]

that used to translate to

load reg1 array
load reg2 array + 2

can be replaced with a single read to a single register. And things like

half2 a,b;
half2 c = a + b

will require 3 registers and one operation instead of 6 and 2 as previously.
But you casting vector datatypes to pointers to scalars and setting vector’s value on per-component basis probably confuses the compiler preventing the optimization.

fangqq · November 19, 2017, 4:31pm

want to make a correction on the commands to test this issue, the git checkout nvidiaomp line needs to be removed. the corrected commands are

git clone https://github.com/fangq/mcxcl.git
cd mcxcl/src
make clean all
cd ../example/benchmark
./run_benchmark1.sh -G 1 -J "-DUSE_HALF"

fangqq · November 19, 2017, 4:50pm

[QUOTE=Salabar;42862]Vega improves half precision performance by introducing packed arithmetics similiar to SSE for x86.

…
But you casting vector datatypes to pointers to scalars and setting vector’s value on per-component basis probably confuses the compiler preventing the optimization.[/QUOTE]

thanks for the reply. I am wondering if I need to add any special extension or compilation flags to enable the Rapid Packed Math (RPM) instructions? I did enable the cl_khr_fp16 extension. is that sufficient if I installed the amdgpu-pro Linux drivers?

PS: found someone had difficulty with Arch Linux Feature request: Half Float (FP16) support · Issue #29 · plaidml/plaidml · GitHub

Salabar · November 20, 2017, 8:32am

It shouldn’t be needing any special extension, though, indeed, it may turn out their compiler does not support the feature yet. Use CodeXL analyzer on this simple kernel to find out if your compiler generates any instructions with a prefix “pk_*”


__kernel void test(__global half2* a, __global half2* b){
int id = get_global_id(0);
a[i] = a[i] + b[i];
}