Doing some reading, I see indications (but not outright statements) in GPU vendor documentation that GPUs do not pack:

vertices across different instances in an instanced draw, nor vertices across different draws in a MultiDrawIndirect draw call
into shared thread groups (warps or wavefronts). I also find devs on the net saying that GPUs do not do this, but without a pointer to vendor docs as the source.

My question is, does anyone know whether this is (or is not) the case?

(If so, I guess this could help explain why the performance of geometry instancing and/or MDI can be low with trivially-simple instances/draws (i.e. few vertices per instance or draw) due to low occupancy doing vertex shader transforms.)


Well, there is one thing that is known for certain: distinct draws represent distinct “invocation groups”. We know this because , while SPIR-V says that the invocation group arrangement for shaders is “implementation dependent”, the Vulkan specification explicitly requires that distinct draws in a multidraw indirect command represent different invocation groups.

The most recent OpenGL/GLSL specifications have been updated with similar language, with GLSL adopting the “invocation group” wording and section 7.9 of the OpenGL spec being added that says “For MultiDraw* commands with drawcount greater than one, invocations from separate draws are in distinct invocation groups.”

Now to the question: is “invocation group” necessarily equivalent to “warp/wavefront”? Invocation groups, in SPIR-V and GLSL, speak directly towards dynamically uniform control flow and expressions. Given what we know about how GPU hardware implements dynamic uniform constructs (namely, that breaking dynamic uniformity may cause warp/wavefront divergence), it stands to reason that invocations from different invocation groups cannot go into the same warp/wavefront.

So I would take that as a strong sign that GPU hardware does not execute invocations from distinct draw operations in the same warp/wavefront. And more specifically, they are explicitly forbidden from doing so by both Vulkan and OpenGL.

I can’t speak to the rest. gl_DrawID is explicitly stated to be dynamically uniform, but gl_InstanceID is not. Combined with the fact that SPIR-V/GLSL specifically allow invocation groups to contain multiple instances, that strongly suggests that there are at least some GPU vendors do allow instances to run in the same warp/wavefront.

As for GS instancing, I think it would rather defeat the purpose of the whole idea if separate instances couldn’t execute in the same warp/wavefront.


Thanks Alfonse! Great info, and a pretty strong argument that MDI draws aren’t packed into shared thread groups.

Doing more reading up on this today, this perf problem with small instances or MDI draws seems to be pretty well known by some. It also sounds like NVidia Task and Mesh Shaders allow you to get around this by re-packing the input work as you see fit (unfortunately, those aren’t an option for us yet). In the absence of that feature, I suspect that pseudo-instancing might be the best option for obtaining good GPU utilization on vertex work with simple meshes.

A few links mentioning the “simple mesh” perf problems associated with “instancing” and “multi-draw indirect” (MDI) along with task and mesh shaders:


  • Use Draw Indirect to generate variable amount of data
  • Preferably avoid low primitive counts (risk of being FrontEnd-limited)

DrawArrays {
Gluint count;


The optional expansion via task shaders allows early culling of a group of primitives or making LOD decisions upfront.
The mechanism scales across the GPU and is therefore superseding instancing or multi draw indirect for small meshes.

Mesh Shaders in Turing (10/2018)

Slide 4:

  • Instancing of very basic shapes

Slide 6:

  • Instancing, Multi-Draw-Indirect etc. already in use

Slide 32:

Some scenes suffer from low - complexity drawcalls (< 512 triangles )

Task shaders can serve as faster alternative to Multi Draw Indirect (MDI)

  • MDI or instanced drawing can still be bottlenecked by GPU
  • Task shaders provide distributed draw call generation across chip
  • Also more flexible than classic instancing (change LOD etc.)

Task Shader Overhead

The other option (not yet used in this sample) is to batch drawcalls with few meshlets into bigger drawcalls, so that the task shader stage becomes more effective again. Task shaders can serve as alternative to instancing/multi-draw-indirect as they can dispatch mesh shaders in a distributed matter.

Especially in models with many small objects, such a technique is highly recommeded (e.g. low-complexity furniture/properties in architecural visualization, nuts and bolts, guardrails etc.)

We can easily batch 32 small drawcalls into a single drawcall by summing the task counts over all batched drawcalls.

Inside the first shader stage we use warp (subgroup) intrinsics to find which actual sub-drawcall we are in.

At the cost of some additional latency you can extend this to a total of 32 * 32 batched drawcalls…

Instancing, and later multi-draw, allowed certain sets of draw calls to be combined together; indirect draws could be generated on the GPU itself. …

Instancing can only draw copies of a single mesh at a time; multi-draw is still inefficient for large numbers of small draws.

An Upgrade Path

The other really neat thing about mesh shaders is that they don’t require you to drastically rework how your game engine handles geometry to take advantage of them. …

Instanced draws are straightforward: multiply the meshlet count and put in a bit of shader logic to hook up instance parameters.

A more interesting case is multi-draw, where we want to draw a lot of meshes that aren’t all copies of the same thing. For this, we can employ task shaders – a secondary feature of the mesh shader pipeline. Task shaders add an extra layer of compute-style work groups, running before the mesh shader, and they control how many mesh shader work groups to launch. They can also write output variables to be consumed by the mesh shader. A very efficient multi-draw should be possible by launching task shaders with a thread per draw, which in turn launch the mesh shaders for all the individual draws.

If we need to draw a lot of very small meshes, such as quads for particles/imposters/text/point-based rendering, or boxes for occlusion tests / projected decals and whatnot, then we can pack a bunch of them into each mesh shader workgroup. The geometry can be generated entirely in-shader rather than relying on a pre-initialized index buffer from the CPU. (This was one of the original use cases that, it was hoped, could be done with geometry shaders – e.g. submitting point primitives, and having the GS expand them into quads.) There’s also a lot of flexibility to do stuff with variable topology, like particle beams/strips/ribbons, which would otherwise need to be generated either on the CPU or in a separate compute pre-pass.