Arrangement (order) of threads inside 2D work unit


If underlying hardware can only operate on N threads simultaneously (like, say, warp or half-warp sizes on NVIDIAs current cards, which are 32/16 threads respectively), how do threads in a single 2D work unit map to these units?

To illustrate, imagine that work unit is 8x4 and warp size is 4 threads.

Is it like this (numbers denote number of warp):


or like this:?


Are there any guarantees or any non-vendor specific ways I can influence that (short of converting to 1D work units and doing the mapping myself)?


This will depend on the driver and hardware thread schedulers. There’s no way to influence it, but I’m having a hard time thinking of why you would want to. From your kernel’s point of view they all execute together (within one work-group) and the hardware will try to schedule them intelligently to maximally hide memory latency. If the card has decent memory coalescing you should see little difference between different execution orders.

I’m curious to know why you would care about this order. :slight_smile:

My kernel threads perform better when they are executed in smaller square-like blocks (like 4x4), because then they are more likely to take the same branch. However, just decreasing work group size to 4x4 is suboptimal on current NVIDIA’s hardware (and tests prove that), probably due to too little occupancy to hide memory latency.

With larger (16x16) work units I believe that threads are executed row by row (current NVIDIA hardware won’t execute 256 threads at once) and, given that current half-warp size is 16, I’m afraid that is a performance loss because 16x1 block is going to diverge more than 4x4 block.

I tried remapping the threads myself, but additional shifts/ands are required for that (or even worse, multiply/divide, if I allow for variable work group size) and influence on performance is not really noticeable.

So now I’m confused: without really knowing the arrangement of warp I can’t explain the numbers I’m seeing. It may be that shifts/ands add overhead that hides the win (which is unlikely) or it may be that I’m wrong with my row-by-row execution assumption and threads are already mapped in a way similar to my second “picture” in previous post, so remapping changes nothing and only slows things down. Or I’m wrong with the basic assumption and 16x1 warps are more efficient for that particular data than 4x4 warps, but I doubt that, too.

UPDATE: about the last possibility: it’s basically ruled out because 16x1 or 1x16 work groups are significantly slower than 4x4 work groups, so the data I’m processing definitely exhibits 2D locality.

Still, 16x16-sized work groups are faster than 4x4, but I’m not sure whether they can be made even faster or not (i.e. whether they are executed as 16 16x1 half-warps or 16 4x4 half-warps).