Roadmap for atomic floating point support?

fangqq · December 3, 2018, 10:34am

Atomicadd for floats has been supported in CUDA since Fermi (circa 2010), but such feature still does not seem to be supported in the latest OpenCL specification. curious is this on the roadmap?

right now I am using the only known hack to get around this problem (via atomic_xchg)

github.com

fangq/mcxcl/blob/master/src/mcx_core.cl#L253-L257


      
          static float rand_uniform01(__private RandType t[RAND_BUF_LEN]) {
              return xorshift128p_nextf(t);
          }
          
          
static void xorshift128p_seed (__global uint* seed, __private RandType t[RAND_BUF_LEN]) {

but it gives me too much overhead on some processors - for example, for Intel CPU, this while(atomic_xchg()) line costs me about 25% of the run-time (in comparison, my CUDA equivalent version of this kernel shows less than 1% latency for atomicadd).

if there is no known timetable for this feature, are there evolved solution to do this with better efficiency?

Melirius · May 31, 2019, 8:18pm

A classic variant of Igor Suhorukov should be faster as it involves only one atomic_cmpxchg operation per loop iteration instead of two atomic_xchg. Guys from StreamHPC have their own variant and suggest that it is even faster, however in my code on AMD GPUs it is of the same performance. Sorry, I cannot include links normally.

http://streamhpc.com/blog/2016-02-09/atomic-operations-for-floats-in-opencl-improved

fangqq · May 31, 2019, 9:15pm

hi Melirius

thanks for the links. I just tested the two variants you suggested (the two versions on the streamhpc link, AtomicAdd_g_f and atomicAdd_g_f ), unfortunately, they performed quite poorly.

In fact, on the AMD GPU (Vega II, ROCm 2.4 driver), the 2nd variant ( atomicAdd_g_f ) produced a speed that is about 1/7 of the speed compared to the version currently in my code [1] with 2x atomic_xchg,

the first version ( AtomicAdd_g_f ) is faster, produced about 1/3 of the speed on the Vega II. The performance on NVIDIA GPU is even worse - on my TitanV GPU, the speed drop from 35803.80 photon/ms to 681.49 photon/ms, a 52x slow down !

Both tests were run on Linux host (Ubuntu 16.04) with the update to date driver. I was a bit surprised how much this single function can change the performance so dramatically.

if anyone wants to reproduce this, please check the codes and run benchmarks using this github post:

github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime

Mcxcl App Slowdown for ROCm 1.8.3 vs amdgpu-pro 17.50 - Potential Compiler issue

opened 01:58AM - 28 Jan 18 UTC

fangq

bug

[mcxcl](https://github.com/fangq/mcxcl) is a package my group developed for effi…cient photon transport simulations. It has good performance on the latest Vega64 GPU using the amdgpu-pro driver, see Fig. 2 of our [recently published paper](https://www.spiedigitallibrary.org/journals/journal-of-biomedical-optics/volume-23/issue-01/010504/Scalable-and-massively-parallel-Monte-Carlo-photon-transport-simulations-for/10.1117/1.JBO.23.1.010504.full). The kernel works fine on all tested OCL implementations from NVIDIA AMD and Intel. However, we recently installed ROCm on one of our Linux servers (Ubuntu 16.04) and tried to run this code using the Vega64 GPU, all of our benchmarks failed with infinite loops. To reproduce this issue, here are the commands ``` git clone https://github.com/fangq/mcxcl.git cd mcxcl git checkout cd src make clean all cd ../example/benchmark ./run_benchmark1.sh -G 1 -n 1e6 # benchmark 1 failed ./run_benchmark2.sh -G 1 -n 1e6 # benchmark 2 failed ./run_benchmark2a.sh -G 1 -n 1e6 # benchmark 2a failed ``` We want to to know what was the cause of this issue and how to make our code compatible with ROCm. thanks

Qianqian

[1] mcxcl/mcx_core.cl at 0fe7813fe789489278075965988ae623d984ad6d · fangq/mcxcl · GitHub

Melirius · June 2, 2019, 12:55am

I have exactly the opposite effect. My code makes multidimensional integration so it involve quite a lot of atomic floating-point additions, however integrand function evaluation is still the most time-consuming part. My fast notebook result by Oland GPU is 438 s for your variant vs 362 s for GROMACS variant on 15.200.1065.0 AMD drivers for Win10 x64. Cannot test on my main Tahiti GPUs yet, they are at work for another week or so. (Yes, I need double precision, so Tahiti are my best friends.)

Have you checked that new function does not change the size of the code to blow it out of the cache, and register pressure does not change number of waves in flight?

Melirius · June 2, 2019, 4:15pm

Extension of your variant to doubles is not as simple as it may seem due to no atom_xchg for double, only for long and ulong. My variant again uses unions:

typedef union {
	ulong intVal;
	double floatVal;
} uni;

inline void atomic_add_local(local double * source, const double operand) {
	uni old;
	old.floatVal = operand;
	uni t, t1;
	t1.floatVal = 0.0; // to ensure correct double bit representation of 0
	do {
		t.intVal = atom_xchg((local ulong *)source, t1.intVal);
		t.floatVal += old.floatVal;
	} while ((old.intVal = atom_xchg((local ulong *)source, t.intVal)) != t1.intVal);
}

Results for doubles on the same workload are 3707 s vs 4215 s again in favour of atom_cmpxchg variant.