Doing some reading, I see indications (but not outright statements) in GPU vendor documentation that GPUs do not pack:
vertices across different instances in an instanced draw, nor
vertices across different draws in a MultiDrawIndirect draw call
into shared thread groups (warps or wavefronts). I also find devs on the net saying that GPUs do not do this, but without a pointer to vendor docs as the source.
My question is, does anyone know whether this is (or is not) the case?
(If so, I guess this could help explain why the performance of geometry instancing and/or MDI can be low with trivially-simple instances/draws (i.e. few vertices per instance or draw) due to low occupancy doing vertex shader transforms.)
Well, there is one thing that is known for certain: distinct draws represent distinct “invocation groups”. We know this because , while SPIR-V says that the invocation group arrangement for shaders is “implementation dependent”, the Vulkan specification explicitly requires that distinct draws in a multidraw indirect command represent different invocation groups.
The most recent OpenGL/GLSL specifications have been updated with similar language, with GLSL adopting the “invocation group” wording and section 7.9 of the OpenGL spec being added that says “For MultiDraw* commands with drawcount greater than one, invocations from separate draws are in distinct invocation groups.”
Now to the question: is “invocation group” necessarily equivalent to “warp/wavefront”? Invocation groups, in SPIR-V and GLSL, speak directly towards dynamically uniform control flow and expressions. Given what we know about how GPU hardware implements dynamic uniform constructs (namely, that breaking dynamic uniformity may cause warp/wavefront divergence), it stands to reason that invocations from different invocation groups cannot go into the same warp/wavefront.
So I would take that as a strong sign that GPU hardware does not execute invocations from distinct draw operations in the same warp/wavefront. And more specifically, they are explicitly forbidden from doing so by both Vulkan and OpenGL.
I can’t speak to the rest. gl_DrawID is explicitly stated to be dynamically uniform, but gl_InstanceID is not. Combined with the fact that SPIR-V/GLSL specifically allow invocation groups to contain multiple instances, that strongly suggests that there are at least some GPU vendors do allow instances to run in the same warp/wavefront.
As for GS instancing, I think it would rather defeat the purpose of the whole idea if separate instances couldn’t execute in the same warp/wavefront.
Thanks Alfonse! Great info, and a pretty strong argument that MDI draws aren’t packed into shared thread groups.
Doing more reading up on this today, this perf problem with small instances or MDI draws seems to be pretty well known by some. It also sounds like NVidia Task and Mesh Shaders allow you to get around this by re-packing the input work as you see fit (unfortunately, those aren’t an option for us yet). In the absence of that feature, I suspect that pseudo-instancing might be the best option for obtaining good GPU utilization on vertex work with simple meshes.
A few links mentioning the “simple mesh” perf problems associated with “instancing” and “multi-draw indirect” (MDI) along with task and mesh shaders: