Extension for partial execution fairness?

As far as I am aware, compute shader specification gives no guarrantee about execution order and/or fairness whatsoever. However I’ve read that some GPUs actually execute threads semi-fairly. For example any thread that executed at least one instruction would be guarranteed to proceed further. It makes some sense - it would be just too expensive to offload entire blocks of threads to cache, or RAM, and interleaving instructions is a simple way to hide latency. However i’ve seen no API to expose this kind of property. Even partial execution fairness guarrantees are useful for achieving synchronisation, such as mutexes and barriers, in this case it would allow synchronisation between blocks of threads in a grid. Does this sort of extension exist, and if not, why?

There are guarantees about execution order for CS’s. Or more to the point, there are mechanisms to control the execution order of invocations within a work group (ie: barrier).

While your broader point is still made, I do have a point to bringing that up.

You forget about divergent conditionals.

Let’s say your GPU has 4 execution units; this means the GPU can have at most 4 invocation groups running independently, simultaneously. And you dispatch something that takes up at least 4 invocation groups. OK, so you start groups 0-3 and once one of them are done, you can start the next groups if any. Fine.

But let’s say that group 0 has a conditional branch that requires divergent execution. That is, some of its invocations take branch 0 and some take branch 1. And there’s no static way to determine how many will take which, and the code along branches 0 and 1 are too big/complex to just have both execute the same code and toss out the untaken branch.

To deal with this, GPUs will effectively create a new invocation group (but not a new work group) by copying the data from the invocations that would execute branch 1 into GPU accessible memory. Let’s call that group 0.1 (group 0, branch 1). Group 0.0 will continue executing branch 0 in that group. Once an execution unit finishes execution, group 0.1 can be executed.

So… what do you suppose happens if groups 0.0 and 1-3 all need to wait on something that is provided by one of the invocations in group 0.1? Well, group 0.1 is not executing and cannot execute until one of the execution units becomes available. But groups 0.0 and 1-3 are hogging all of the execution units. And they’re all just busy-waiting on group 0.1, an invocation group that cannot execute until one of them is finished.

Deadlock.

This is why there are no forward progress or fairness guarantees. You cannot have those in a system that isn’t preemptively multithreaded. And GPUs (to my knowledge) aren’t.

And this is where barrier comes in. barrier is a thing that allows invocations in the same work-group to synchronize execution. This works because the execution manager logic understands what invocation groups are in which work groups and can make sure that they execute. So if instead of busy-waiting, group 0.0 issued a barrier, if group 0.1 isn’t currently executing, the system would suspend group 0.0 and start group 0.1 (even if other groups could execute). The system knows to do this because it understands the relationship between these two invocation groups.

Having that understanding is vital to providing such guarantees. barrier is essentially cooperative multithreading; it’s a direct statement that the thread management system understands and knows how to deal with within a narrow case.

There’s no mechanism to create that understanding, that mechanism, through busy-waits on atomic or the like. If you want some kind of mutex or barrier that functions across work groups, it’s not something you could ever create purely in the language. It has to be an explicit mechanism of the system.

If branching would affect the behavior such that the exposed guarantees are no longer true, that much could be said in specification, as well as to which extents it affects it(e.g. only invocations within subgroup must not diverge). In fact, for barrier() glsl spec says For compute shaders, the barrier() function may be placed within control flow, but that control flow must be uniform control flow. If it is required that there is only as many invocation groups as there is compute units (the hardware to which thread blocks are presumably bound), so that there is only enough threads to always execute without any need for preemption, the count could be exposed; At least CUDA tells you the multiprocessor count, and I’m not aware of any Vulkan extension that would let you query that kind of information.

I have found an nvidia whitepaper that claims “For CUDA compute tasks, Pascal is also capable of preempting at the finest granularity possible— instruction level”.

A RDNA whitepaper says that it has a tool that forces a group of instructions to be uninterruptable, but it porbably means some scheduling guarrantees. It also says “Earlier generations tended to issue instructions from the oldest wavefront and sometimes different wavefronts within a work-group would make uneven progress. The compute unit scheduler has a new oldest work-group scheduling policy that can be activated by software.”. To my understanding, “issue instructions from the oldest wavefront” practically means that every existing wave will execute, as the waves that have not executed for a long time will get older, until its the oldest and then execute. Although RDNA instruction set also contains a instruction for usable priority, where The overall wave priority is {SPIPrio[1:0] + UserPrio[1:0], WaveAge[3:0]}. (SPIPrio - “Wavefront priority set by the shader processor interpolator (SPI) when the wavefront is created”).

I was also able to implement a inter-workgroup barrier(the idea is not mine, but it uses a bit different algorithm) on my 1050ti(768 cores, 6 SMs, each with 4 warp schedulers) gpu. It works with groups sizes from 32(where maximum discovered threads is 6144, total dispatched is much higher, checked with renderdoc) to 1024(about 12288 threads found):

  1. “discovers” active threads
    1. atomicAdd to a uint threadBeginDiscover counter, save to a local uint myPosition
    2. some shared work (it gets distibuted via atomicAdd until its all taken, not necceserally executed; the counter used is separate and is safe to increment beyond the job count) for delay; Experimentation shows this step is not neccesary, but it shouldn’t hurt - to ensure high discovery rates.
    3. barrier(), to ensure that all threads in a workgroup have done the job
    4. “elect a thread” within workgroup via if(gl_LocalInvocationIndex == 0).
      1. atomicAdd to a uint threadEndDiscover counter, get the value and modify to what it would be after the atomic; compare it to threadBeginDiscover. Local threadBeginDiscover value was updated at barrier(), and should not have been after, if i understand memory barriers correctly. if equal:
        atomicCompSwap(threadsDiscovered, 0, threadBeginDiscover);, which ensures that the value seen is same for any thread, when it is non-zero
      2. while(threadsDiscovered == 0)memoryBarrier(); It is there, with unfair scheduler the deadlocks would occur.
    5. barrier(); Synchronises the threads with the elected thread and updates the memory
    6. if(myPosition >= threadsDiscovered)return; Discards the threads that are too late
  2. The simplier barrier where we know how many threads exist:
    1. Have a secondary global variable for signaling, uint signal
    2. barrier()
    3. Elect a thread
      4. uint mySignal = signal+1;
      5. compare totalThreads to end result of atomicAdd(currentThreads, )
      6. if equal:
      1. currentThreads = 0, initialise other values that should be
      2. memoryBarrier()
      3. signal+=1
      4. while(signal != mySignal)memoryBarrier();
    4. barrier()

The barrier() and electing threads is done to decrease amount of atomics required, and does not accidentally make the amount of executable threads below the amount of executable units; Rearranging threads depending on branching was introduced in Volta, a successor to Pascal(my gpu architecture). Even with size 32(the SIMD group size on nVidia) workgroup, 6144 of them were discovered, meaning 32 were running on each SM(CU), or 8 per warp scheduler, and still executed fairly enough for the barrier of this type to work.

This barrier takes about 1 microsecond, or ~2k cycles, with clock speed ~1800mhz (checked with AIDA). I used renderdoc to check execution time, and a loop of said “barriers”, 2^20 iterations - about 1 second total. Changing workgroup size did not change time significantly: size 1024 was only about 1.3 times faster than size 32. I think that this is because the main bottleneck being the waits for atomic variables to move between SMs, where they could be quickly modified by all threads waiting to do this.

This empirically shows that at least some gpus are already capable of somewhat fair scheduling.
The problem, again, is that it seems this kind of fairness is not exposed anywhere (docs, extensions, etc)

Found this, basically saying that you can do locks (implies some fairness guarantees) so long only one thread in a warp is active after acquiring the lock (intra-warp scheduling is not fair). Not quite a specification or similiar of what I want, not sure how relates to AMD (probably similiar if there is enough who used those), and quite old as well, but is more than nothing…