Thanks Jan, yeah it certainly helps From my understanding the latency from larger branches on modern GPUs comes from the pipeline. Where a block of fragments are executed simultaneously and results are sync’d. For that block you get the worst performance of the block.
On older cards that did not have “real” branching, just conditional memory ops, the card would execute both branches and then discard a result during a mov.
That’s my understanding of things at the moment, which may be flawed. For now I’ve decided to do the 1x1 texture look up. This is mostly due to the fact that the potential latency is easily hidden by other samples in most of the shaders, so I’m really just paying for a 1 cycle mul extra.
Towards the end of the project I may clean this up, but I suspect I won’t really NEED to.