I know most GPUs are very sensitive to memory access patterns. The most common example is that coelesced memory access (consecutively increasing work items in a wavefront access consecutively increasing memory locations) and broadcasts (all work items in a wavefront access the same memory location). If the consecutively is changed to strided, then the performance decreases with stride length.
However what would be the expected performance if instead of consecutively increasing memory locations there were consecutively decreasing memory locactions? I saw this topic in an NVIDIA forum for CUDA with two bewildered posters and no other responses–http://forums.nvidia.com/index.php?showtopic=186657.