GPU vertex dispatch for MultiDrawIndirect and/or Instanced draw calls

Dark_Photon · August 17, 2019, 3:45am

For others that hit this MDI “small draws” inefficiency and find themselves researching, I thought I’d add some other references to it that I’ve found, along with some past solutions.

It’s been out there for quite a few years, and I just hadn’t tuned into it before.

2017-11:

Multidraw/ExecuteIndirect have their upsides and downsides. Most GPUs can pack together multiple instances (DrawInstanced) in a single vertex shader wave/warp, but can’t pack together multiple draws. Multidraw thus isn’t any more efficient to the GPU as submitting multiple real draws (with identical state), but the limitation is that you can’t change state between these draws. Nvidia introduced a new multidraw extension recently for Vulkan that allows changing state and bindings between draw calls: https://developer.nvidia.com/device-generated-commands-vulkan.

Of course there are many way to implement multidraw-style rendering without actually using multidraw. This is one of them: LINK

Our Siggraph presentation presented one other way (strip clustering). If you need to select between these two, select the one described in the B3D thread. Strip clustering isn’t very good (especially on older AMD GPUs).

If you have multidraw, I would use a hybrid tech with compute shader written 16 bit indices (saves 50% of index read & write bandwidth). Pack a small amount of clusters to each draw to avoid overloading CP + avoid partial wave problems.

2016-03:

https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

2016-02:

Disclaimer: I am mainly talking about how (old and new) AMD GPUs work. Nvidia and Intel GPUs work similarly. …

GPUs do not pack multiple draw calls to the same wave. The same is true for MultiDraw (or ExecuteIndirect). These are also counted as multiple draws. If the draws are small, there will be partially empty waves.

Instanced draw (DrawInstanced) will count as a single draw. The GPU will pack vertex shader invocations of multiple instances to the same wave (and the same is true for pixel shaders and other shader stages). …

I remember a OpenGL 4.3 AZDO (multidraw) benchmark comparing multiple AMD, Nvidia and Intel GPUs … IIRC, AMD performance started to plummet when draws were less than 300 triangles each. Nvidia’s performance plummeted in draws less than 100 triangles. … ( The minimum number of triangles per draw call )

Also it’s worth noting that Nvidia’s vertex shader wave size is 32, while AMD is twice as big (64 vertices). GPUs do not pack vertices from separate draw calls (including multidraw) to the same wave. This means that draw call vertex shader cost is always rounded up to the next 32/64. Small draws thus always waste cycles.

However my idea (above) [see below] packs all draws (of a single multidraw) to a single huge standard indexed draw. This way multiple small objects will be packed to the same wave, and there is no empty vertex shader lanes. And no command processor bottleneck either.

2015-08:

Advances in Real-Time Rendering- SIGGRAPH 2015
http://advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf
2015-08 - Mesh Cluster Rendering (GPU-Driven Rendering Pipelines)

Geometry clustering history and MDI

In my presenter notes there was a quick reference to Avalanche Studios merge-instancing technique (page 19->):

http://www.humus.name/Articles/Persson_GraphicsGemsForGames.pptx.

Merge-instancing was a big inspiration for us 3 years ago. Our fixed size (64 vertex strip) clustering is similar to it. However merge-instancing emulates the index buffering inside the vertex shader. This unfortunately means that merge-instancing needs to always execute 3 vertex shader invocation per triangle. Our strip based method needs to only execute one vertex shader invocation per triangle in the best case (64 vertex cluster is a perfect strip). However in reality, you need to insert degenerate triangles, causing some extra vertex shader work. In moderately high poly meshes we can achieve around 1.5 transformed vertices per triangle. This is 2x better than merge-instancing, but still lower than the best cache optimized index buffered methods.

Multi-draw-indirect (and ExecuteIndirect in DX12) solve this issue, as each draw instance can have it’s own index start offset (meaning that each instance can have unique topology). However MDI stresses the GPU command processor much more than the vertex shader based custom technique. Also GPUs cannot pack multiple actual draws to a single vertex shader wave/warp. We have measured that MDI starts to lose efficiency (on current GPUs) when the cluster size becomes smaller than 256.

Conclusion: MDI gets you higher performance for high polygon objects, while the custom vertex shader clustering is better(on current GPUs) for small low poly objects (such as background).