How the geometry shader process instance parallel?

#1

Hi guys, I read the doc from the wiki, it says “Layered rendering can be more efficient with GS instancing, as different GS invocations can process instances in parallel”. So, I want to know how GS do this in parallel, it need a hardware support? Or it process in parallel by nature?Is there any doc to explain the parallelism about the GS?

I know this maybe a simple and stupid question,hope you guys can help me, thanks!

#2

What that means is “in parallel compared to not using GS instancing”.

To perform layered rendering for the purpose suggested by the Wiki means to take the same primitive and send it to multiple different viewports, probably using different transformation matrices.

If you’re not using GS instancing, then your GS looks something like this:

for(number_of_layers)
{
  for(each_vertex_in_primitive)
  {
    gl_out.gl_Position = transform_vertex(layer_ix, vertex);
    gl_Layer = layer_ix;
    EmitVertex();
  }
  EndPrimitive();
}

So each geometry shader invocation will generate number_of_layers primitives. Now, this generation happens sequentially; each GS invocation computes each primitive for all of the layers, one after the other. Multiple GS invocations can be running at the same time of course, but each one will be outputting their own set of number_of_layers primitives.

This also means that each GS invocation needs a much larger buffer to store output primitive data into, since each invocation is writing a lot of primitive data.

By using GS instancing, your GS now can look like this:

layout(invocations = number_of_layers) in;
...
for(each_vertex_in_primitive)
{
  gl_out.gl_Position = transform_vertex(gl_InvocationID, vertex);
  gl_Layer = gl_InvocationID;
  EmitVertex();
}
EndPrimitive();

Each GS invocation only emits a single primitive. So multiple GS invocations can be working with the same input primitive and emit data for different output primitives for different layers in parallel.

This also means that each GS invocation is much shorter and doesn’t need nearly as much storage for its outputs (since it is only outputting a single primitive).