Mesh shader performance

Aks · October 12, 2021, 6:29pm

Dear friends,

I would like to start a new topic on mesh (and task) shader performance where we could share our experiences and suggest proper ways of using the new pipeline.

Namely, some time ago I tried to improve the performance of my renderer by implementing new features offered by the last two generations of NVIDIA GPUs – mesh shaders. Enthusiastically I read several (available) blogs and specifications, rolled up my sleeves and modified the engine. The first results were a little bit worse than I expected. The mesh shader rendered a single vertex and defined up to two triangles per thread. Since the warp size is 32 and the quadratic patch generated by the workgroup has 25 vertices (5x5 vertices and 32 triangles), the thread utilization was “only” 78%. Comparing to the instance-based rendering with the same patch size, the mesh shader exposed about 3.5% lower performance.

Guided by the thought that increasing the threads’ utilization, alongside with higher vertex post-transform cache hits (because of bigger meshlet), I “packed” several vertices in a single thread (i.e. mesh shader invocation). Only a 2-vertex-per-thread implementation had approximately the same performance as instance-rendering (about 0.2% better). All other implementations (from 1 up to 8 vertices per thread) had worse performance). The largest meshlet (256 vertices and 450 triangles) exposed almost 18% lower performance than instance-renderer with the same size instances.

I would like to hear about your experiences with mesh shaders. Does anyone know how to efficiently render a grid with mesh shaders? It is expected that a well-optimized classical vertex-shader-based renderer is hard to beat, but a new pipeline should expose at least some improvement. What am I doing wrong?

Alfonse_Reinheart · October 13, 2021, 2:28am

Mesh shaders are a tool that deals with fairly specific problems in the standard pipeline. Not every workload is best suited to them. As such, you have to think about whether what you’re trying to render fits into their paradigm effectively.

You say you’re trying to “render a grid”. That doesn’t sound like the kind of scenario where mesh shaders would be helpful.

Aks · October 17, 2021, 2:42pm

Alfonse, thank you for the response. You are right. For efficient work, one should have a proper tool. However, mesh shaders are a new (alternative, fast, modern, etc.) way to do anything. Task/mesh shader geometric pipeline comes with lots of benefits, like:

Higher scalability and flexibility – less impact of the fixed-function parts of the pipeline (no primitive fetch, no predefined patterns and use of thousands of general-purpose cores instead of less numerous specialized ones).
Efficient (parallel) cluster culling and level-of-detail control in task shaders – the greatest improvement is actually in reduction of the number of triangles being rendered.
A single-pass solution – The previously mentioned benefits could be achieved with compute shaders also, but with the price of two passes and the need to serialize output (i.e. index buffers for primitives to render) and the use of indirect drawing.
Less memory utilization – attributes can be packed/unpacked/interpolated in shaders (ability to fetch uninterpolated data with NV_fragment_shader_barycentric extension in the fragment shader).
Avoiding index buffer bottleneck – both bandwidth saving (large index values) and sequential rescanning for primitive fetch.
Distributed dispatcher – the task shader can spawn thousands of mesh shader workgroups which should be more efficient than indirect drawing or instancing (at least it is what can be read from the existing “literature”).

Everything mentioned above is theoretically true and looks fine and shiny. But in the reality, it is somewhat different. Mostly because of the high level of responsibility transferred to programmers and sometimes having no clues what’s going on under the hood. On the other hand, it is very challenging to compete with the highly optimized drivers developed by men knowing what exactly happens in their hardware and spending years to squeeze the performance. Of course, they have to do it for the general case and some clever implementation may squeeze a bit more performance for the particular one. In order to shed some light on “opened questions” and try to reveal some implementation specifics.

Let’s start with the vertex post transform cache…

A temporary memory block used for output and shared variables is limited to 16kB. Its size actually depends on max values defined in layout qualifier (max_vertices, max_primitives), type of primitives being generated, additional per-vertex and per-primitive output variables, as well as shared variables. It is quite clear and acceptable. It seems that the max_vertices is limited to 256 only for the reason to store that count in a single uint_8 value. A triangle count for the 256-ver quadratic mesh is 450, so the next higher suitable value for the primitive count is 512. (And, yes, there is a story of 126-primitive-entry blocks for indices, but that’s currently out of the scope and should limit the primitive count to 504.) Apart from this, I cannot imagine the reason for the previous limitations. The real reason could be the proper size of the vertex post transform cache and the limiting size of a meshlet to fit in the cache. On the other hand, presentations and blogs mention proper vertex packing for better cache utilization, which implies that the vertex post transform cache (and primitive assembly stage) is decoupled from the mesh shaders.

I needed to write the whole previous paragraph just to ask a question: What happens when a cache miss occurs? Answer to this question may reveal what might cause a performance loss. The vertex shaders are also batched into warp-size groups (that meshlet should correspond to), but maybe they can be reorganized in the case of a cache miss and rebatched just needed ones. Meshlet is a much heavier vehicle to restart.

The other question is related to the maximum number of vertices per meshlet: Why it is not confined to the cache size if mesh shader temporary memory is not used as a cache?

One of the benefits, according to the existing blogs and presentations, is using distributed dispatcher – the task shader. Both instancing and indirect drawing serialize primitive dispatch. Each task shader may spawn multiple mesh shader workgroups removing one possible bottleneck. That leads to the third question*: Does glDrawMeshTasksNV() suffers from the same syndrome (“serialized dispatcher”) if there is no task shader, since dispatching is done through the single point from the API?*

The answer to a previous question may lead to the answer of the following one: Does “tree expanding” (spawning more than a single mesh workgroup per a task workgroup) boost performance? It probably depends on many factors. Task shaders come with a certain price. This question assumes that their purpose is only to spawn mesh shaders.

The last question is not related to the mesh shaders directly, but serves to estimate the impact of the SSBO access to mesh shader performance. Is there any extension that enables glBindBufferRange() usage with offset granularity finer than 16 bytes? std430 layout allows direct access to all data, but offset aligned with 16B prevents me to emulate the full functionality of the mesh shader (in my application) with instancing and SSBO (instead of attributes and devisor). The usage of UBO might improve performance, because SSBO is a more general concept, but the limited size and inability to support arrays of arbitrary length make them inappropriate.

These are certainly not all the questions that bother me these days, but the current post is already long enough. In order to give some chance to get any answer (a allow skipping all previous text), I’ll repeat questions at the end.

What happens with mesh shaders when vertex post transform cache miss occurs?
Why the max_vertices value is not confined to the cache size if mesh shader temporary memory is not used as a cache?
Does glDrawMeshTasksNV() suffer from the syndrome of “serialized dispatcher” if there is no task shader, since dispatching is done through the single point from the API?
Does “tree expanding” (spawning more than a single mesh workgroup per task workgroup) boost performance?
Is there any extension that enables glBindBufferRange() usage with offset granularity finer than 16 bytes? (Not a mesh shader related-question.)

Alfonse_Reinheart · October 17, 2021, 8:50pm

Just because something is new does not make it fast. Your list of advantages reads like advertisements, not answers to the question of whether you should use a thing for a particular purpose.

Consider a drill. With the right head, a drill is perfectly capable of screwing in screws, as well as doing other things. But… we still have and use manual screwdrivers. There’s a reason for that.

Yes, the mesh shading pipeline is good at doing fine-grained LOD. Yes, index buffers can be a bottleneck in some applications. Yes, task shaders can be quite good at generating large number of disparate rendering requests, or culling them outright.

But you’re rendering a grid. You’re not doing fine-grained LOD-ing. A regular grid doesn’t have a bottleneck with index buffers; they get great vertex-reuse and you can limit yourself to 16-bit indices very easily (using multi-draw indirect). And the act of rendering a regular grid is exceedingly straightforward.

Let’s end there. You’re using a mesh shader; the post-T&L cache does not apply to you. As such, this question:

is meaningless. It requires a circumstance predicated on the existence of a thing that does not exist.

You can think of a mesh shader’s output variables as a cache, where the indices are fetching particular elements from the cache of vertex data. And maybe that’s exactly how a particular post-T&L cache implementation works.

But since it’s not happening automatically and requires explicit user intervention, I don’t think it’s reasonable to call it a “cache”. It’s just indexing an array.

Because the index list has limitations all its own. The vertex arrays you’re writing to have to be indexed. And whatever index list lives behind gl_PrimitiveIndicesNV may not have 32-bit precision.

Performance is rarely as simple of a matter as “do this, that makes things faster”. That being said, I cannot think of a reason why you would ever have a fixed 1:1 ratio of tasks to meshes. That would mean each task is doing very little.

Remember what the mesh shader paradigm is meant to accomplish. Each task shader workgroup generates a series of meshes. That only makes sense if there is parallelism to exploit in the generation of those meshes, and that only makes sense if a single task work group routinely generates multiple meshes.

No. Implementations are what exposes the limitation, not OpenGL. If the SSBO offset alignment is set to 16, that means the hardware cannot accommodate a lower alignment.

Aks · October 23, 2021, 6:13pm

Thank you very much for the answers and comments, Alfonse! They are useful and inspiring like they’ve always been.

What confused me was mentioning (in the “Introduction to Turing Mesh Shaders” blog by Christoph Kubisch) applying “a vertex cache optimizer on the index-buffer prior to the generation of the meshlet data”. It was said in the context of the original index-buffer, before splitting into meshlets. And that makes perfect sense. While I was reading the article, I was thinking about my application, where meshlets are created procedurally and hence the misunderstanding arose.

Yes, they have 8-bit precision. I still think that there are no special hardware restrictions to 256 vertices per meshlet besides confining index values to single bytes. But that’s perfectly valid, since increasing the number would duplicate the size.

I’m actually rendering a terrain, and, frankly, extremely efficiently. Blocks (which are grids) are small enough so that index buffer size is completely irrelevant. Few days before, I carried out an experiment with 8-bit vs 32-bit index buffers. The rendering speed is exactly the same (up to 3 or 4 decimal places).

image_2021-10-23_201114

In the previous chart, X-axis represents rendering speed in ms, while Y-axis defines block size in vertices (5x5, 8x8, etc). The range of block sizes corresponds to meshlets that can be generated with a single warp.
What bothers me, and the primary reason I have started this topic is the peculiar behavior of the mesh shader. Namely, it is expected that the rendering rate should increase with the meshlet size (vertices reuse increases) and the number of threads being executed per warp (the cores utilization increases). However, it is not as expected.

image_2021-10-23_201148

The previous chart represents the rendering speed of three different methods (Y-axis is rendering time in ms, X-axis is the block size). InstAttr is classical instancing with attributes (later we will see the impact of SSBO on the performance). Mesh method uses only mesh shaders invoked from the API (without task shaders), while Task+Mesh method uses a single task shader workgroup invocation to generate thousands of mesh shader workgroups. Obviously, the classical instancing method proves to be the most efficient one (lower is better).
Maybe the next chart better shows what’s happening by comparing rendering efficiency (number of triangles rendered in a millisecond). Packing two vertices in a single mesh shader invocation (64 per workgroup) makes the mesh shader as efficient as instancing (the case of 32 threads per workgroup/warp). Also, 11x11-block (31 threads per workgroup) has good performance. However, 16x16-block (also 32 threads per workgroup, 8 vertices per thread and 450 per warp) has significantly lower performance compared to the instancing counterpart.

image_2021-10-23_201235

According to the previous chart, invocation from the task shader has even worse performance, except for the largest blocks (and fewer mesh shader workgroups). This should be investigated thoroughly. However, the question from the beginning still stands. How to improve the performance of mesh shaders, since they should have a higher potential for performance than instancing?
SSBO access certainly has some impact on the performance. Let’s see how an instancing-based algorithm changes its performance if per-instance data is read from the SSB instead of from attributes. The following chart illustrates this case.

image_2021-10-23_201257

Obviously, there is a constant shift in performance. The primitive fetching mechanism still works efficiently for the regular geometric pipeline. On the other hand, SSBO is large enough not to be efficiently cached compared to attributes of a small instance. Having that in mind, 8x8-block mesh-shader-based renderer is actually more efficient in vertex processing than instancing. However, overall performance is about the same.

More details about the experiments:
• Application: large terrain rendering
• GPU processor: GeForce RTX 2060 SUPER
• Triangle count: 4.2 – 4.7 million
• Heavyweight vertex transformations (rendering time is proportional to the number of vertices being sent to the pipeline)
• GPU utilization: 7-9% (there is no CPU bottleneck, the algorithm is extremely efficient and the power management mode of the GPU must be set to “Prefer maximum performance” to prevent significant clock reduction)