Thanks Alfonse! Great info, and a pretty strong argument that MDI draws aren’t packed into shared thread groups.
Doing more reading up on this today, this perf problem with small instances or MDI draws seems to be pretty well known by some. It also sounds like NVidia Task and Mesh Shaders allow you to get around this by re-packing the input work as you see fit (unfortunately, those aren’t an option for us yet). In the absence of that feature, I suspect that pseudo-instancing might be the best option for obtaining good GPU utilization on vertex work with simple meshes.
A few links mentioning the “simple mesh” perf problems associated with “instancing” and “multi-draw indirect” (MDI) along with task and mesh shaders:
Turing introduces a new programmable geometric shading pipeline, mesh shaders,, enabling threads to cooperatively generate compact meshes on the chip.
Mesh Shaders in Turing (10/2018)
Slide 4:
MOTIVATION
AUXILIARY MESHES
Instancing of very basic shapes
Slide 6:
GENERAL DIRECTIONS
IMPROVE RAW RENDERING
Instancing, Multi-Draw-Indirect etc. already in use
Slide 32:
TINY DRAW CALLS
Some scenes suffer from low - complexity drawcalls (< 512 triangles )
Task shaders can serve as faster alternative to Multi Draw Indirect (MDI)
MDI or instanced drawing can still be bottlenecked by GPU
Task shaders provide distributed draw call generation across chip
Also more flexible than classic instancing (change LOD etc.)
Task Shader Overhead
…
The other option (not yet used in this sample) is to batch drawcalls with few meshlets into bigger drawcalls, so that the task shader stage becomes more effective again. Task shaders can serve as alternative to instancing/multi-draw-indirect as they can dispatch mesh shaders in a distributed matter.
Especially in models with many small objects, such a technique is highly recommeded (e.g. low-complexity furniture/properties in architecural visualization, nuts and bolts, guardrails etc.)
We can easily batch 32 small drawcalls into a single drawcall by summing the task counts over all batched drawcalls.
[C++ code snippet]
Inside the first shader stage we use warp (subgroup) intrinsics to find which actual sub-drawcall we are in.
[GL compute shader code snippet]
At the cost of some additional latency you can extend this to a total of 32 * 32 batched drawcalls…
Instancing, and later multi-draw, allowed certain sets of draw calls to be combined together; indirect draws could be generated on the GPU itself. …
Instancing can only draw copies of a single mesh at a time; multi-draw is still inefficient for large numbers of small draws.
An Upgrade Path
The other really neat thing about mesh shaders is that they don’t require you to drastically rework how your game engine handles geometry to take advantage of them. …
Instanced draws are straightforward: multiply the meshlet count and put in a bit of shader logic to hook up instance parameters.
A more interesting case is multi-draw, where we want to draw a lot of meshes that aren’t all copies of the same thing. For this, we can employ task shaders – a secondary feature of the mesh shader pipeline. Task shaders add an extra layer of compute-style work groups, running before the mesh shader, and they control how many mesh shader work groups to launch. They can also write output variables to be consumed by the mesh shader. A very efficient multi-draw should be possible by launching task shaders with a thread per draw, which in turn launch the mesh shaders for all the individual draws.
If we need to draw a lot of very small meshes, such as quads for particles/imposters/text/point-based rendering, or boxes for occlusion tests / projected decals and whatnot, then we can pack a bunch of them into each mesh shader workgroup. The geometry can be generated entirely in-shader rather than relying on a pre-initialized index buffer from the CPU. (This was one of the original use cases that, it was hoped, could be done with geometry shaders – e.g. submitting point primitives, and having the GS expand them into quads.) There’s also a lot of flexibility to do stuff with variable topology, like particle beams/strips/ribbons, which would otherwise need to be generated either on the CPU or in a separate compute pre-pass.
1 Like