Not too long ago I found out that gpus don’t like it when execution of code is (too) branching and it made me feel like there might be some trick to make efficient conditional processing of data.
I thing I made one strange snippet.
code
#define store_if_true(Ty, addr, val, flag) \
*((__global Ty*)(((unsigned long)addr) + (((unsigned long)(flag == 0)) * ((1LU << 48) - ((unsigned long)addr))))) = val;
__kernel void kern(__global unsigned int* buffer) {
unsigned long ix = get_global_id(0);
buffer[ix] = ~(0U); // to not get garbage in the buffer
unsigned long even = (ix & 1) == 0;
store_if_true(unsigned int , &buffer[ix], 0, even); // nullify the value! (maybe)
}
This kernel writes a value only if invocation index is even without ifs *_*
The idea here is that when the flag is true, the store address gets computed to be beyond (1 << 48) byte location and apparently stores beyond valid address space on gpu… just become noops!
This looks useful, but I have a concern that this code might be exploiting some undefined behaviour. Is it safe to use it?
I would not expect that a wild write to arbitrary GPU memory “just become noops!”. This behavior could be well-defined or undefined, depending on whether or not you are using robust memory accesses. I don’t know if that’s an option in OpenCL or if it’s mandatory.
But regardless, wild writes are not something you should routinely do.
The reason to think they may be noops came to me from reading reference on amd gcn isa, where something like “stores to addresses not located in range of physical memory get discarded” was written there.
This is what’s puzzling me: apparently, they do and after some experimentation I can say memory location need not even be 1 << 48!
I have run the following code on two machines, one with intel hd graphics and another with radeon gpu.
The behaviour on intel machine seems like it does care if I write out of buffer bounds to any address, and on radeon any reads or writes to addresses below 1 << 40 segfault the programm.
__kernel void kern3(__global unsigned int* buffer) {
unsigned long ix = get_global_id(0);
// guaranteed to fill buffer with zeroes.
// attempt to read anything bellow it crushes the programm
buffer[ix] = *((__global unsigned int*)(1ul << 40)) ;
// same here. writes to location above 1 << 40 are ok, below crushes the programm
*((__global unsigned int*)(1ul << 40)) = 0 ;
// *((__global unsigned int*)(1ul << 39)) = 0 ; this segfaults on amd
}
I run a couple of iterations of one kernel with ifs and one kernel with this cursed addressing and I see about 1 second difference on 1GB buffer that cannot be dismissed due to random variation, I think.
__kernel void kern2(__global unsigned int* buffer) {
unsigned long ix = get_global_id(0);
unsigned long even = (ix & 1) == 0;
if (even) {
buffer[ix] = 0;
} else {
buffer[ix] = -1;
}
}