subgroup functions imply sub_group_barrier?

Hi all,

I’ve been looking at the OpenCL subgroup features lately and have a question about the subgroup functions (e.g. sub_group_all and sub_group_any) defined in 9.17.3.4 of the The OpenCL Extension Specification:
https://www.khronos.org/registry/cl/specs/opencl-2.0-extensions.pdf#page=138

Do these functions imply a sub_group_barrier? That is, when a work-item reaches one of these functions must it wait for all other work-items in the subgroup to reach the function before continuing? I do not see this explicitly stated in the documentation, although I’m having a hard time imagining the semantics of these instructions without an implicit barrier.

From my understanding, subgroups are similar to CUDA warps, which execute in lock-step; thus the CUDA __all and __any functions naturally have an implicit intra-warp barrier. Is this also the case for OpenCL subgroups?

Maybe I missed something in the documentation?

Thanks in advance!

Each work item in a subgroup will reach a subgroup function, but cache coherency is not guaranteed by the spec.
I.e:

int x = input[idx];
array[idx] = x;
int max = subgroup_reduce_max(array[idx]);//This can expand into normal reduction that uses local memory or into some reduction op in GPU’s ISA
output[idx] = max - array[get_subroup_size(0) - 1 - idx]; //No explicit barrier specified by a programmer. This means contents of the “array” are undefined for each particular thread in a subgroup from specification standpoint.

It very well may be your device would be fine with you abusing undefined behavior, but here is what optimizing compiler’s line of thought may be: value of array[get_subroup_size(0) - 1 - idx] is undefined, which means it might as well be zero with an exception of the thread 15 (subgroup of size 32). So it turns your code into

int max = subgroup_reduce_max(x);
output[idx] = max;
if (idx == 15)
output[idx] -= x;

ruining your code while feeling very smart.

This can be prevented by adding a barrier.

int x = input[idx];
array[idx] = x;
int max = subgroup_reduce_max(array[idx]);
sub_group_barrier(…);//Does not have an actual mapping to GPU’s ISA, but ensures the compiler won’t do any sort of funny buisiness with the line below.
output[idx] = max - array[get_subroup_size(0) - 1 - idx];

Thanks for the reply!

You mention:

Each work item in a subgroup will reach a subgroup function

Does this mean each work-item will wait at the subgroup function for the rest of the work-items, even if this waiting doesn’t provide memory ordering guarantees? I understand your example, as the undefined behaviour arises due to a data-race on the ‘array’ memory locations. This means that sub-group functions do not disallow data-races (this is very useful to know!). However, I think my question is probably better captured by this piece of code (which has no potential data-races). Assume the output array is initialised to 0 and is the size of a subgroup. Additionally, assume there is only one subgroup executing the piece of code.

a: int x = 0;
b: if (sub_group_id() == 0) { x = 1; }

c: while (sub_group_any(x)) {
d: output[sub_group_id()] = 1;
e: x = 0;
}

Is this piece of code well-defined? And is it guaranteed that ‘output’ will now contain all 1’s? The execution we are worried about is this:

Say subgroup work-item 0 gets priority in executing. It executes statement b and then gets to statement c. It knows that locally x == 1, so locally it knows that sub_group_any will be true. If there is no implied barrier, then subgroup work-item 0 could continue executing (without waiting) based on local knowledge. It continues to statement d and then e. When subgroup work-item 0 returns to c, its local x is now 0, and it cannot continue until more information is acquired (i.e. by the execution of other subgroup work-items). Now the other subgroup work-items start execution, they get to statement c and the sub_group_any(x) will now evaluate to false (based on the current values of x in the subgroup). This means that the other subgroup work-items do not get to execute statement d and ‘output’ contains only a single 1.

If an implicit execution barrier is provided, then the above execution is disallowed, because subgroup work-item 0 will have to wait at the first instance of sub_group_any(), even though locally it knows that it can continue. Likewise, in SIMT execution (e.g. CUDA warps), the above execution is disallowed because work-items will execute in lock-step, disallowing the interleaving described above.

The other option I see is that the above code is undefined, violating this line of the specification:

These built-in functions must be encountered by all work-items in a subgroup executing the kernel

If this is the case, how would we make the above code defined? I imagine maybe by placing a sub_group_barrier immediatly after instruction c inside the while-loop? But this isn’t clear to me from reading the specification.

Thanks again!

I guess the description of _any

Evaluates predicate for all work-items in the sub-group and returns a non-zero value if predicate evaluates to non-zero for any workitems
in the sub-group.

implies that the function cannot return until every thread in the subgroup reaches it. This way a thread missing a subgroup function results in a deadlock. I got confused with execution barrier \ memory barrier distinction.

Ok, that makes sense. Many thanks for the discussion!