My kernel threads perform better when they are executed in smaller square-like blocks (like 4x4), because then they are more likely to take the same branch. However, just decreasing work group size to 4x4 is suboptimal on current NVIDIA’s hardware (and tests prove that), probably due to too little occupancy to hide memory latency.
With larger (16x16) work units I believe that threads are executed row by row (current NVIDIA hardware won’t execute 256 threads at once) and, given that current half-warp size is 16, I’m afraid that is a performance loss because 16x1 block is going to diverge more than 4x4 block.
I tried remapping the threads myself, but additional shifts/ands are required for that (or even worse, multiply/divide, if I allow for variable work group size) and influence on performance is not really noticeable.
So now I’m confused: without really knowing the arrangement of warp I can’t explain the numbers I’m seeing. It may be that shifts/ands add overhead that hides the win (which is unlikely) or it may be that I’m wrong with my row-by-row execution assumption and threads are already mapped in a way similar to my second “picture” in previous post, so remapping changes nothing and only slows things down. Or I’m wrong with the basic assumption and 16x1 warps are more efficient for that particular data than 4x4 warps, but I doubt that, too.