Let’s say I’ve got a shader, probably a compute shader that is doing a whole lot of matrix and vector math, and at the end of this math it performs a dot product between two vectors which would result in a single scalar value. This shader is being executed potentially a large number of times on different data, and since it’s a shader it does so essentially in parallel. The net result would be that each “thread” of the shader would result in its own scalar value, meaning I’d have an array of these values, one element for each “thread”. Is there any way to have the shaders merge the individual values into a single number? An example of this could be taking an average of all the values, or producing a single boolean based upon the sign of all the scalar values.

I realize that I could just use the CPU to iterate through all the results from the shader and compute such results that way, but I thought perhaps it could be more efficient to have the shader which is already doing lots of computing on the values could simply merge the results of multiple threads into the single value that I’m interested in.

One possibility that I thought could make some sense is if there was some “global” variable location in GLSL. For the averaging example it could simply have an initial value of 0 and each shader “thread” would add its result to this variable. In the end of have a single value that just needs to be divided by the number of threads and I’d have the average. However I see that doing such a thing as that would probably enforce that the threads must take turns adding their result to the global variable, which is the same as iteration in the end, and I’d be better off using my CPU.

Anyway I’m still learning about graphics cards, and I was mostly just curious to know how or if separate GL threads can effectively share and condense results. I’ve heard of people using GLSL to implement sorting algorithms in parallel, which seems like a related challenge to me.

there is a problem: not all “threads” can share data, only “threads” within a work group can share data

another important limitation: 1 work group has (at least) 1024 “invocations” (local size)

Okay, so how does this example look? it is untested and useless since A and B are uniforms, but I’m just curious if the logic behind it makes sense?

[NOTE]#version 430

layout (local_size_x = 128) in; //small number just for this example

uniform vec3 A;
uniform vec3 B;

shared float outArray [128];

void main (void)
{
outArray[gl_LocalInvocationIndex] = dot(A, B);

int i;
int pow = 2;
int lgOfGroupSize = log2(gl_WorkGroupSize.x);
for(i = 0; i < lgOfGroupSize; i++)
{
[INDENT]memoryBarrierShared​(); // Ensure change to shared variables are visible in other invocations
barrier(); // Stall until every thread is finished assigning shared variables

`````` if(mod(gl_LocalInvocationIndex, pow) == 0) //take only half of the previously active threads
{
[INDENT] int dist = pow / 2; //the distance to the related node
outArray[gl_LocalInvocationIndex] += outArray[gl_LocalInvocationIndex + dist]; //calculate the sum
}
//else
//   return;  //not sure it this is possible since it would cause this thread to miss all successive barrier calls
pow *= 2;[/INDENT]
``````

}

float Average = outArray[gl_LocalInvocationIndex] / gl_WorkGroupSize.x;[/INDENT]
}[/NOTE]

Thanks btw.