Parallel_for_work_group call without workGroupSize specified

Hello,

While reading SYCL 1.2.1 specification, there is a piece of template code about parallel_for_work_group call without workGroupSize specified (p. 175):

1 myQueue.submit([&](handler & cgh) {
2     // Issue 8 work-groups. The work-group size is chosen by the runtime because unspecified
3     cgh.parallel_for_work_group<class example_kernel>(
4         range<3>(2, 2, 2), [=](group<3> myGroup) {
5
6         // Launch a set of work-items for each work-group. The number of work-items is chosen
7         // by the runtime because the work-group size was not specified to parallel_for_work_group
8         // and a logical range is not specified to parallel_for_work_item.
9         myGroup.parallel_for_work_item([=](h_item<3> myItem) {
10           //[work-item code]
11       });
12
13      // Implicit work-group barrier
14 ...

range<3>(2,2,2) in line 4, in my understanding, is the number of work-groups to be executed. My question is in line 9: the call parallel_for_work_item does not have workGroupSize specified. So, in this case, how many work items in total globally will be executed?

Thank you first for anyone helping me to understanding, or point out anything I may misunderstanding from the spec.

Regards,
Amon

Hello,

There are two forms of parallel_for_work_group. One just takes the number of work-groups, as you’ve pointed out, and the other takes both the number of work-groups and the size of each work-group.

The first form:

void parallel_for_work_group(
  range<dimensions> numWorkGroups,
  WorkgroupFunctionType kernelFunc)

and the second form:

void parallel_for_work_group(
   range<dimensions> numWorkGroups,
   range<dimensions> workGroupSize,
   WorkgroupFunctionType kernelFunc)

Within parallel_for_work_group, you typically embed a parallel_for_work_item construct (as your example has). One form of parallel_for_work_item allows you to specify the number of work-items in the work-group (SYCL has the concept of a logical work-group size, but I don’t think that’s important here).

First form:

 void parallel_for_work_item(workItemFunctionT func) const

and the second form:

void parallel_for_work_item(range<dimensions> logicalRange,
   workItemFunctionT func) const;

If you combine the forms of these constructs that do not define the work-group size, then you’re correct that it isn’t clear what the work-group size (and total number of work-items globally) should be. The spec actually says that doing this is illegal because it’s ambiguous. In the definition of the parallel_for_work_item form that does not take a logical work-group size:

It is undefined behavior for this member function to be invoked from within the parallel_for_work_group form that does not define work-group size, because then the number of work-items that should execute the code is not defined. It is expected that this form of parallel_for_work_item is invoked within the parallel_for_work_group form that specififies the size of a workgroup.

So I think the answer to your question is that you should define a work-group size either within parallel_for_work_group, or within the nested parallel_for_work_item calls. It is illegal to not specify the work-group size anywhere because then the global amount of work to do is undefined, which I think is what motivated you to post.

You can specify the work-group size in both parallel_for_work_group and parallel_for_work_item, and those sizes can be different. That’s where the SYCL concept of “physical” versus “logical” work-group sizes starts to matter, and can be quite powerful.