For others that hit this MDI “small draws” inefficiency and find themselves researching, I thought I’d add some other references to it that I’ve found, along with some past solutions.
It’s been out there for quite a few years, and I just hadn’t tuned into it before.
2017-11:
Multidraw/ExecuteIndirect have their upsides and downsides. Most GPUs can pack together multiple instances (DrawInstanced) in a single vertex shader wave/warp, but can’t pack together multiple draws. Multidraw thus isn’t any more efficient to the GPU as submitting multiple real draws (with identical state), but the limitation is that you can’t change state between these draws. Nvidia introduced a new multidraw extension recently for Vulkan that allows changing state and bindings between draw calls: https://developer.nvidia.com/device-generated-commands-vulkan .
Of course there are many way to implement multidraw-style rendering without actually using multidraw. This is one of them: LINK
Our Siggraph presentation presented one other way (strip clustering). If you need to select between these two, select the one described in the B3D thread. Strip clustering isn’t very good (especially on older AMD GPUs).
If you have multidraw, I would use a hybrid tech with compute shader written 16 bit indices (saves 50% of index read & write bandwidth). Pack a small amount of clusters to each draw to avoid overloading CP + avoid partial wave problems.
2016-03:
https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf
(Motivation – Death By 1000 Draws)
However, the GPU still chokes on tiny draws; it is quite common to see the 2nd half of the base pass barely utilizing the GPU. Typically there are lots of tiny details or distant objects, of which most are Hi-Z culled. The efficiency loss comes from the GPU still having to run mostly empty vertex wavefronts.
In this GPU capture, you can see on the left that we start out alright, but very quickly on the right we end up spinning on vertex shader wavefronts that that don’t result in any pixels. …
2016-02:
Disclaimer: I am mainly talking about how (old and new) AMD GPUs work. Nvidia and Intel GPUs work similarly. …
GPUs do not pack multiple draw calls to the same wave. The same is true for MultiDraw (or ExecuteIndirect). These are also counted as multiple draws. If the draws are small, there will be partially empty waves.
Instanced draw (DrawInstanced) will count as a single draw. The GPU will pack vertex shader invocations of multiple instances to the same wave (and the same is true for pixel shaders and other shader stages). …
I remember a OpenGL 4.3 AZDO (multidraw) benchmark comparing multiple AMD, Nvidia and Intel GPUs … IIRC, AMD performance started to plummet when draws were less than 300 triangles each. Nvidia’s performance plummeted in draws less than 100 triangles. … ( The minimum number of triangles per draw call )
Also it’s worth noting that Nvidia’s vertex shader wave size is 32, while AMD is twice as big (64 vertices). GPUs do not pack vertices from separate draw calls (including multidraw) to the same wave. This means that draw call vertex shader cost is always rounded up to the next 32/64. Small draws thus always waste cycles.
However my idea (above) [see below] packs all draws (of a single multidraw) to a single huge standard indexed draw. This way multiple small objects will be packed to the same wave, and there is no empty vertex shader lanes. And no command processor bottleneck either.
2015-08:
Advances in Real-Time Rendering- SIGGRAPH 2015
http://advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf
2015-08 - Mesh Cluster Rendering (GPU-Driven Rendering Pipelines)
Geometry clustering history and MDI
In my presenter notes there was a quick reference to Avalanche Studios merge-instancing technique (page 19->):
http://www.humus.name/Articles/Persson_GraphicsGemsForGames.pptx .
Merge-instancing was a big inspiration for us 3 years ago. Our fixed size (64 vertex strip) clustering is similar to it. However merge-instancing emulates the index buffering inside the vertex shader. This unfortunately means that merge-instancing needs to always execute 3 vertex shader invocation per triangle. Our strip based method needs to only execute one vertex shader invocation per triangle in the best case (64 vertex cluster is a perfect strip). However in reality, you need to insert degenerate triangles, causing some extra vertex shader work. In moderately high poly meshes we can achieve around 1.5 transformed vertices per triangle. This is 2x better than merge-instancing, but still lower than the best cache optimized index buffered methods.
Multi-draw-indirect (and ExecuteIndirect in DX12) solve this issue, as each draw instance can have it’s own index start offset (meaning that each instance can have unique topology). However MDI stresses the GPU command processor much more than the vertex shader based custom technique. Also GPUs cannot pack multiple actual draws to a single vertex shader wave/warp. We have measured that MDI starts to lose efficiency (on current GPUs) when the cluster size becomes smaller than 256.
Conclusion: MDI gets you higher performance for high polygon objects, while the custom vertex shader clustering is better(on current GPUs) for small low poly objects (such as background).