Why work is the group size specified inside the shader (local_size_x)?

laobrasuca · August 12, 2012, 6:28pm

hi all,

why?! specially given that in OpenCL it is declared during the application runtime (when calling clEnqueueNDRangeKernel()). In OpenGL, one is able to specify only the number of work groups during the application runtime (with glDispatchCompute()). We lose in flexibility by imposing it during the compute shader compilation time. Any evolution expected on this matter?

another question, is there any restriction on the values specified by local_size_x (layout qualifier) and num_groups_x (input of glDispatchCompute() )? Example, in OpenCL global_work_size must be evenly divisible by local_work_size.

cheers,

lao

laobrasuca · August 13, 2012, 9:34am

or, is it legal to do the following:

uniform int local_size_x_from_app;
layout (local_size_x = local_size_x_from_app) in;

?

Alfonse_Reinheart · August 13, 2012, 9:59am

The grammar clearly defines it as “local_size_x= integer-constant”. Uniforms aren’t constants, so no.

laobrasuca · August 13, 2012, 4:17pm

yes, true, I had read it but didn’t realize.

concerning my first post, any thoughts?

thx,

lao

Dark_Photon · August 13, 2012, 5:16pm

Here’s a total guess. When you’re writing optimized OpenCL kernels, sometimes you want these to be constants defined in the shader source. For instance, sometimes you unroll some or all of the iterations, and (for instance) stop doing barrier/synchronization when the problem size is reduced to <= 1 thread (1 warp). You can encase these unrolls inside “if” checks based on the local size (items/workgroup, threads/block, etc.). If the local size is constant, then the compiler can remove these if checks altogether and only pull in the iterations (and memory barriers) that are absolutely needed with no run-time conditional evaluation on the GPU.

That said, I haven’t digested the compute shader extension yet, and am definitely no OpenCL/CUDA expert.

laobrasuca · August 14, 2012, 1:18pm

thanks for that. Optimized code must be one of great motivations indeed. It is maybe somehow linked to the fact that they drop precision (floating point computation) in favor of performance. All things considered, it still is OpenGL, so, graphics oriented, and it makes sense to consider that visual effects will not be too sensible to some loss on generality or precision.

it is a good thing that the ARB finally decided to have a shader dedicated to generic computing. sadly, it arrived a year too late for me, I’ve done all my stuff on CL/GL already xD It will get me some time to get my hands on it and change all my libraries.

thx again,

lao

aqnuep · August 15, 2012, 8:01am

The reason for that is really that the local work group size in fact affects the shader code. Think about it: local work group size affects thread scheduling scheme and shared memory usage pattern. While the driver could hide this and allow the developer to supply this at dispatch time, however, in fact it would probably still require a shader recompile so might not be deterministic from a performance point of view how expensive a compute dispatch is, even if the driver caches the shaders.

However, you can easily manage multiple local work group size yourself by simply creating multiple shaders with local work group sizes of your choice and select the appropriate when needed. This way there are no hidden costs and you can expect optimum dispatch speed.

Regarding GL compute shaders over CL-GL interop, well, you should definitely be able to gain some performance by using GL compute shaders as no matter how nice is CL-GL interop, developers often complain about its performance hit due to synchronization between the contexts. GL compute shaders are not affected by such cross-context synchronization issues.

Dark_Photon · August 15, 2012, 6:30pm

Re “no matter how nice is CL-GL interop”, au contraire! It’s not nice at all. GL_ARB_cl_event / cl_khr_gl_event have never shipped in the drivers I care about. And that leaves you with the glFinish()/clFinish() sledgehammer approach and massive pipeline bubbles to show for it.

Forget even running CL/GL tasks in parallel on the GPU at once. You can’t even run them in series on the GPU at once.

I, for one, heartily welcome our new Compute Shader overlords!

system · October 19, 2021, 7:18pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.