Roadmap for atomic floating point support?

#1

Atomicadd for floats has been supported in CUDA since Fermi (circa 2010), but such feature still does not seem to be supported in the latest OpenCL specification. curious is this on the roadmap?

right now I am using the only known hack to get around this problem (via atomic_xchg)

https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L253-L257

but it gives me too much overhead on some processors - for example, for Intel CPU, this while(atomic_xchg()) line costs me about 25% of the run-time (in comparison, my CUDA equivalent version of this kernel shows less than 1% latency for atomicadd).

if there is no known timetable for this feature, are there evolved solution to do this with better efficiency?

#2

A classic variant of Igor Suhorukov should be faster as it involves only one atomic_cmpxchg operation per loop iteration instead of two atomic_xchg. Guys from StreamHPC have their own variant and suggest that it is even faster, however in my code on AMD GPUs it is of the same performance. Sorry, I cannot include links normally.


http://streamhpc.com/blog/2016-02-09/atomic-operations-for-floats-in-opencl-improved

#3

hi Melirius

thanks for the links. I just tested the two variants you suggested (the two versions on the streamhpc link, AtomicAdd_g_f and atomicAdd_g_f ), unfortunately, they performed quite poorly.

In fact, on the AMD GPU (Vega II, ROCm 2.4 driver), the 2nd variant ( atomicAdd_g_f ) produced a speed that is about 1/7 of the speed compared to the version currently in my code [1] with 2x atomic_xchg,

the first version ( AtomicAdd_g_f ) is faster, produced about 1/3 of the speed on the Vega II. The performance on NVIDIA GPU is even worse - on my TitanV GPU, the speed drop from 35803.80 photon/ms to 681.49 photon/ms, a 52x slow down !

Both tests were run on Linux host (Ubuntu 16.04) with the update to date driver. I was a bit surprised how much this single function can change the performance so dramatically.

if anyone wants to reproduce this, please check the codes and run benchmarks using this github post:

Qianqian

[1] http://github.com/fangq/mcxcl/blob/0fe7813fe789489278075965988ae623d984ad6d/src/mcx_core.cl#L251-L260

#4

I have exactly the opposite effect. My code makes multidimensional integration so it involve quite a lot of atomic floating-point additions, however integrand function evaluation is still the most time-consuming part. My fast notebook result by Oland GPU is 438 s for your variant vs 362 s for GROMACS variant on 15.200.1065.0 AMD drivers for Win10 x64. Cannot test on my main Tahiti GPUs yet, they are at work for another week or so. (Yes, I need double precision, so Tahiti are my best friends.)

Have you checked that new function does not change the size of the code to blow it out of the cache, and register pressure does not change number of waves in flight?

#5

Extension of your variant to doubles is not as simple as it may seem due to no atom_xchg for double, only for long and ulong. My variant again uses unions:

typedef union {
	ulong intVal;
	double floatVal;
} uni;

inline void atomic_add_local(local double * source, const double operand) {
	uni old;
	old.floatVal = operand;
	uni t, t1;
	t1.floatVal = 0.0; // to ensure correct double bit representation of 0
	do {
		t.intVal = atom_xchg((local ulong *)source, t1.intVal);
		t.floatVal += old.floatVal;
	} while ((old.intVal = atom_xchg((local ulong *)source, t.intVal)) != t1.intVal);
}

Results for doubles on the same workload are 3707 s vs 4215 s again in favour of atom_cmpxchg variant.