GPU vertex dispatch for MultiDrawIndirect and/or Instanced draw calls

Dark_Photon · January 22, 2019, 2:18pm

Doing some reading, I see indications (but not outright statements) in GPU vendor documentation that GPUs do not pack:

vertices across different instances in an instanced draw, nor
vertices across different draws in a MultiDrawIndirect draw call

into shared thread groups (warps or wavefronts). I also find devs on the net saying that GPUs do not do this, but without a pointer to vendor docs as the source.

My question is, does anyone know whether this is (or is not) the case?

(If so, I guess this could help explain why the performance of geometry instancing and/or MDI can be low with trivially-simple instances/draws (i.e. few vertices per instance or draw) due to low occupancy doing vertex shader transforms.)

Alfonse_Reinheart · January 22, 2019, 5:04pm

Well, there is one thing that is known for certain: distinct draws represent distinct “invocation groups”. We know this because , while SPIR-V says that the invocation group arrangement for shaders is “implementation dependent”, the Vulkan specification explicitly requires that distinct draws in a multidraw indirect command represent different invocation groups.

The most recent OpenGL/GLSL specifications have been updated with similar language, with GLSL adopting the “invocation group” wording and section 7.9 of the OpenGL spec being added that says “For MultiDraw* commands with drawcount greater than one, invocations from separate draws are in distinct invocation groups.”

Now to the question: is “invocation group” necessarily equivalent to “warp/wavefront”? Invocation groups, in SPIR-V and GLSL, speak directly towards dynamically uniform control flow and expressions. Given what we know about how GPU hardware implements dynamic uniform constructs (namely, that breaking dynamic uniformity may cause warp/wavefront divergence), it stands to reason that invocations from different invocation groups cannot go into the same warp/wavefront.

So I would take that as a strong sign that GPU hardware does not execute invocations from distinct draw operations in the same warp/wavefront. And more specifically, they are explicitly forbidden from doing so by both Vulkan and OpenGL.

I can’t speak to the rest. gl_DrawID is explicitly stated to be dynamically uniform, but gl_InstanceID is not. Combined with the fact that SPIR-V/GLSL specifically allow invocation groups to contain multiple instances, that strongly suggests that there are at least some GPU vendors do allow instances to run in the same warp/wavefront.

As for GS instancing, I think it would rather defeat the purpose of the whole idea if separate instances couldn’t execute in the same warp/wavefront.

Dark_Photon · January 22, 2019, 6:38pm

Thanks Alfonse! Great info, and a pretty strong argument that MDI draws aren’t packed into shared thread groups.

Doing more reading up on this today, this perf problem with small instances or MDI draws seems to be pretty well known by some. It also sounds like NVidia Task and Mesh Shaders allow you to get around this by re-packing the input work as you see fit (unfortunately, those aren’t an option for us yet). In the absence of that feature, I suspect that pseudo-instancing might be the best option for obtaining good GPU utilization on vertex work with simple meshes.

A few links mentioning the “simple mesh” perf problems associated with “instancing” and “multi-draw indirect” (MDI) along with task and mesh shaders:

Task Shader Overhead
…

The other option (not yet used in this sample) is to batch drawcalls with few meshlets into bigger drawcalls, so that the task shader stage becomes more effective again. Task shaders can serve as alternative to instancing/multi-draw-indirect as they can dispatch mesh shaders in a distributed matter.

Especially in models with many small objects, such a technique is highly recommeded (e.g. low-complexity furniture/properties in architecural visualization, nuts and bolts, guardrails etc.)

We can easily batch 32 small drawcalls into a single drawcall by summing the task counts over all batched drawcalls.

[C++ code snippet]

Inside the first shader stage we use warp (subgroup) intrinsics to find which actual sub-drawcall we are in.

[GL compute shader code snippet]

At the cost of some additional latency you can extend this to a total of 32 * 32 batched drawcalls…

Instancing, and later multi-draw, allowed certain sets of draw calls to be combined together; indirect draws could be generated on the GPU itself. …

Instancing can only draw copies of a single mesh at a time; multi-draw is still inefficient for large numbers of small draws.

An Upgrade Path

The other really neat thing about mesh shaders is that they don’t require you to drastically rework how your game engine handles geometry to take advantage of them. …

Instanced draws are straightforward: multiply the meshlet count and put in a bit of shader logic to hook up instance parameters.

A more interesting case is multi-draw, where we want to draw a lot of meshes that aren’t all copies of the same thing. For this, we can employ task shaders – a secondary feature of the mesh shader pipeline. Task shaders add an extra layer of compute-style work groups, running before the mesh shader, and they control how many mesh shader work groups to launch. They can also write output variables to be consumed by the mesh shader. A very efficient multi-draw should be possible by launching task shaders with a thread per draw, which in turn launch the mesh shaders for all the individual draws.

If we need to draw a lot of very small meshes, such as quads for particles/imposters/text/point-based rendering, or boxes for occlusion tests / projected decals and whatnot, then we can pack a bunch of them into each mesh shader workgroup. The geometry can be generated entirely in-shader rather than relying on a pre-initialized index buffer from the CPU. (This was one of the original use cases that, it was hoped, could be done with geometry shaders – e.g. submitting point primitives, and having the GS expand them into quads.) There’s also a lot of flexibility to do stuff with variable topology, like particle beams/strips/ribbons, which would otherwise need to be generated either on the CPU or in a separate compute pre-pass.

Dark_Photon · August 17, 2019, 3:45am

For others that hit this MDI “small draws” inefficiency and find themselves researching, I thought I’d add some other references to it that I’ve found, along with some past solutions.

It’s been out there for quite a few years, and I just hadn’t tuned into it before.

2017-11:

Multidraw/ExecuteIndirect have their upsides and downsides. Most GPUs can pack together multiple instances (DrawInstanced) in a single vertex shader wave/warp, but can’t pack together multiple draws. Multidraw thus isn’t any more efficient to the GPU as submitting multiple real draws (with identical state), but the limitation is that you can’t change state between these draws. Nvidia introduced a new multidraw extension recently for Vulkan that allows changing state and bindings between draw calls: https://developer.nvidia.com/device-generated-commands-vulkan.

Of course there are many way to implement multidraw-style rendering without actually using multidraw. This is one of them: LINK

Our Siggraph presentation presented one other way (strip clustering). If you need to select between these two, select the one described in the B3D thread. Strip clustering isn’t very good (especially on older AMD GPUs).

If you have multidraw, I would use a hybrid tech with compute shader written 16 bit indices (saves 50% of index read & write bandwidth). Pack a small amount of clusters to each draw to avoid overloading CP + avoid partial wave problems.

2016-03:

https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

2016-02:

Disclaimer: I am mainly talking about how (old and new) AMD GPUs work. Nvidia and Intel GPUs work similarly. …

GPUs do not pack multiple draw calls to the same wave. The same is true for MultiDraw (or ExecuteIndirect). These are also counted as multiple draws. If the draws are small, there will be partially empty waves.

Instanced draw (DrawInstanced) will count as a single draw. The GPU will pack vertex shader invocations of multiple instances to the same wave (and the same is true for pixel shaders and other shader stages). …

I remember a OpenGL 4.3 AZDO (multidraw) benchmark comparing multiple AMD, Nvidia and Intel GPUs … IIRC, AMD performance started to plummet when draws were less than 300 triangles each. Nvidia’s performance plummeted in draws less than 100 triangles. … ( The minimum number of triangles per draw call )

Also it’s worth noting that Nvidia’s vertex shader wave size is 32, while AMD is twice as big (64 vertices). GPUs do not pack vertices from separate draw calls (including multidraw) to the same wave. This means that draw call vertex shader cost is always rounded up to the next 32/64. Small draws thus always waste cycles.

However my idea (above) [see below] packs all draws (of a single multidraw) to a single huge standard indexed draw. This way multiple small objects will be packed to the same wave, and there is no empty vertex shader lanes. And no command processor bottleneck either.

2015-08:

Advances in Real-Time Rendering- SIGGRAPH 2015
http://advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf
2015-08 - Mesh Cluster Rendering (GPU-Driven Rendering Pipelines)

Geometry clustering history and MDI

In my presenter notes there was a quick reference to Avalanche Studios merge-instancing technique (page 19->):

http://www.humus.name/Articles/Persson_GraphicsGemsForGames.pptx.

Merge-instancing was a big inspiration for us 3 years ago. Our fixed size (64 vertex strip) clustering is similar to it. However merge-instancing emulates the index buffering inside the vertex shader. This unfortunately means that merge-instancing needs to always execute 3 vertex shader invocation per triangle. Our strip based method needs to only execute one vertex shader invocation per triangle in the best case (64 vertex cluster is a perfect strip). However in reality, you need to insert degenerate triangles, causing some extra vertex shader work. In moderately high poly meshes we can achieve around 1.5 transformed vertices per triangle. This is 2x better than merge-instancing, but still lower than the best cache optimized index buffered methods.

Multi-draw-indirect (and ExecuteIndirect in DX12) solve this issue, as each draw instance can have it’s own index start offset (meaning that each instance can have unique topology). However MDI stresses the GPU command processor much more than the vertex shader based custom technique. Also GPUs cannot pack multiple actual draws to a single vertex shader wave/warp. We have measured that MDI starts to lose efficiency (on current GPUs) when the cluster size becomes smaller than 256.

Conclusion: MDI gets you higher performance for high polygon objects, while the custom vertex shader clustering is better(on current GPUs) for small low poly objects (such as background).