How much waste can the warp divergence bring?


I know that if within one warp the threads have different branches or number of loops, all the branches or the maximum number of loops would be executed for all of them.

However, I am confused by the “execution” of useless operations imposed on one thread (A) caused by another thread (B) who really needs to execute it. If it is an addition, does thread A also need to add two numbers? If it is a memory read, does thread A also need to read from somewhere in the global memory?

If such operation is just dummy, how much waste could it bring to the entire performance?



According to my previous experiments, it all depends on your algorithm.
I was doing tree search and only with 2 threads, the speed could decrease by 40-50 percent.
For the whole warp, the code with divergence was approximately 6-8 times slower than without it. But of source is all depends on your application and structures.

Just perform mini tests - 2 threads -> and calculate the perfect case and the average actual one.
Then you might be able to calculate it for the whole warp.

Thanks krocki!

Do you think those imposed operations are actually executed or they just skip over quickly? For example, if it is a read operation, does a memory read really take place?

Thanks again!

In my testing memory reads are not executed, I presume the whole load/store pipe is not working at all.

It depends on the hardware a bit, but i also believe that if the low-numbered work items are the only ones occupied, only the wavefronts they occupy will be executed and the higher wavefronts will not be.

The only ‘waste’ is that those wavefronts are not active - i.e. there are potential fops/cpu cycles which are not being utilised. It doesn’t slow down the others that are active. It really has a direct relation to how many work items in a given wavefront are active, depending on the hardware.

Thanks notzed!