Starting with Compute shader after working with OpenCL

hterrolle · December 14, 2020, 1:59pm

Hi,

I am starting to anderstand compute shader after working with OpenCL.

and i am not sure to anderstand the difference and how to use it.

Whit OpenCL you need to give the range ,example (1024,1024), and than give (x,y) range, exanple (2,2). wich means that (1024,1024) will be divided in workGroup of (2,2) so (1024*10247)/4 = 262144 workgroup of 4 thread for an array of (2,2)

If i want to do the same using compute shader i need to give the work group number inside glDispatchCompute(x,y,z) but also give information to the shader :
layout(local_size_x,local_size_y,local_size_z) in;

In the ARM documentation the shader information correspond to the number of thread that the workgroup is going to use, in my case 128 thread max per workGroup. And the workgroup max is 65535.

In fact i can see clearly that 65535 workgroup in compute shader is far less than 262144 using OpenCL.

is that means that to do the same as OpenCL i should do :

glDispatchCompute(512,512,1) and in shader layout(local_size_x = 2,local_size_y = 2) in;
to get workgroup of 4 thread and (512*512) workgroup

glDispatchCompute(256,256,1) and in shader layout(local_size_x = 4,local_size_y = 4) in;
to get workgroup of 16 thread and (256*256) workgroup

glDispatchCompute(128,128,1) and in shader layout(local_size_x = 8,local_size_y = 8) in;
to get workgroup of 64 thread and (128*128) workgroup

I need to use shared memory that is why i am asking. I want to be sure that i anderstoud ;))

So if i want to implement my OpenCL kernel in openGL compute shader i need to think differently or i may be wrong and misanderstoud something.

thank for explanation.

GClements · December 14, 2020, 5:20pm

The limits are implementation dependent; you need to query the values of GL_MAX_COMPUTE_WORK_GROUP_COUNT, GL_MAX_COMPUTE_WORK_GROUP_SIZE and GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS to get the limits for the implementation being used. Note that the first two are indexed queries which return the limit for a specific dimension; use glGetIntegeri_v to query them.

The minimum values are 65535×65535×65535, 1024×1024×64 and 1024 respectively. IOW, glDispatchCompute(512,512,1) should work on any implementation but whether glDispatchCompute(262144,1,1) works is implementation-dependent.

hterrolle · December 14, 2020, 9:07pm

Thanks for the answer,

But i do not anderstand why :

whether glDispatchCompute(262144,1,1) works is implementation-dependent.

How can you do it if your GL_MAX_COMPUTE_WORK_GROUP_COUNT is (65535,65535,65535).

Do you means that it does not matter of X,Y,Z and you can do :
glDispatchCompute(262144,1,1)
and
layout(local_size_x = 2,local_size_y = 2) in;

I am sorry but i need to anderstand.

so the glDispatchCompute do not care of the limit as long as you are not over.
(512*512) = 262144 and (262144,1,1) = 262144 workGroup.

What’s about the shader layout(local_size_x = ?,local_size_y = ?) in;

it is nice to go to the end of an explanation ;))

GClements · December 15, 2020, 12:40am

You can’t. But those values are the minimum requirement; an implementation is free to provide higher limits.

Not unless the implementation allows such a large limit. But the “shape” of the compute operation doesn’t really matter. 512×512×1 and 262144×1×1 both create 262144 work groups; the only difference is the values given to gl_WorkGroupID and gl_GlobalInvocationID. Similarly, the shape of an individual work group doesn’t really matter either, just the number of invocations within the group. The only reason for using a particular shape for either is that if you’re implementing an algorithm which operates on multi-dimensional arrays you can use the members of gl_WorkGroupID, gl_LocalInvocationID and/or gl_GlobalInvocationID directly rather than having to decompose a single index using division and modulo.

hterrolle · December 15, 2020, 4:20pm

Thanks again for the answer,

I have done some testing about compute shader today and i found out that using single array of data [] does not ollow you to use neither Y workGroup and Y layout. You must use X workgroup and X layout.

X workgroup * X layout must be equal to the number of data you want to procces. So for that i anderstoud. ;))

If OpenCL define input as array[][] automaticly. It is not the case for OpenGL compute shader how match the input array format .

So if i have a texture (6*6) and i want to process it 2 by 2, without optimization ;)) i need to do the following : 9 workGroup of 4 thread so i will procees 36 input data.

GLES31.glDispatchCompute(9,1,1);
layout(local_size_x = 4) in;

if i want to use shared memorie i need to declare :
shared uint temp_storage[4];

But as the input data data is only single array []. how to i feed my shared memorie.

In this point it look for me very different from OpenCL. I think i need an example ;))

I do not anderstand how do do it. And specially when a have to jump one row.

thank in advance

GClements · December 15, 2020, 6:20pm

uint x1 = gl_WorkGroupID.x % 3; // 0 1 2 0 1 2 0 1 2
uint y1 = gl_WorkGroupID.x / 3; // 0 0 0 1 1 1 2 2 2
uint x0 = gl_LocalInvocationID.x % 2; // 0 1 0 1
uint y0 = gl_LocalInvocationID.x / 2; // 0 0 1 1
uint x = x1*2+x0;
uint y = y1*2+y0;

Avoiding the need for this is the only reason that compute shaders have x,y,z dimensions. If you’re dealing with 2D or 3D arrays and their shape fits within the implementation’s limits, you may as well use that shape for the local_size_* values and/or the glDispatchCompute parameters. If individual dimensions exceed limits, then you can just “reshape” the computation and compute the correct indices yourself.

The only aspect that corresponds to a physical limitation is the partitioning of the computation into work groups, as shared variables are only shared within a work group (similarly, the barrier function and shader invocation group functions only operate within a work group).

The shape of the local and global sizes only affect the gl_WorkGroupID, gl_LocalInvocationID and gl_GlobalInvocationID variables. For everything else, only the total size (the product x*y*z) matters.

hterrolle · December 16, 2020, 3:45pm

Thanks you GClements,

Exectly what i was looking for. It is the perfect answer even if i spend fews hours to anderstand it ;)).

So i made few change ;))

this example is for (66) 1D array with (22) box. So 9 Workgroup and 4 thread in X layout.
Then you can try modifyng the box size and of cour Worgroup size and X layout;))

      "const uint boxX = 2u;                      // X box size
      "const uint boxY = 2u;                      // Y box size
      "const uint row = 6u;                        // Y row size
      "const uint col = 6u;                         // X colone size
      "const uint modWG = row/boxX;     // modulo for gl_WorkGroupID
      "const uint modLO = col/modWG;  // modulo for gl_LocalInvocationID

      "uint x1 = gl_WorkGroupID.x % modWG;
      "uint y1 = gl_WorkGroupID.x / modWG;
      "uint x0 = gl_LocalInvocationID.x % modLO;
      "uint y0 = gl_LocalInvocationID.x / modLO;
      "uint x   = x1 * boxX + x0;
      "uint y   = y1 * boxY + y0;

for debuging

output_data.elements[gl_GlobalInvocationID.x] = input_data.elements[x + (y * row)];

This is Good for 1D array. I will come back later for 2D array in case of i do not anderstand ;))

Aniway, thanks a lot for the perfect answer. For me it was perfect ;))

PS:
So if i use 2D array i would not need to use Modulo as i anderstoud.
It is possible to have an example so we complete the studie ;))

hterrolle · December 17, 2020, 7:09pm

Why do i see “this example is for (66) 1D array with (22)”
the multiplicateur disapered. ?

GClements · December 18, 2020, 8:10am

The asterisk is used for formatting: *italic* → italic, **bold** → bold. That’s why the characters between the two asterisks are italicised. Use a backslash to have formatting characters treated literally.

More information on supported markup can be found here.

system · October 19, 2021, 6:05pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.