Mesh shader performance

Dear friends,

I would like to start a new topic on mesh (and task) shader performance where we could share our experiences and suggest proper ways of using the new pipeline.

Namely, some time ago I tried to improve the performance of my renderer by implementing new features offered by the last two generations of NVIDIA GPUs – mesh shaders. Enthusiastically I read several (available) blogs and specifications, rolled up my sleeves and modified the engine. The first results were a little bit worse than I expected. The mesh shader rendered a single vertex and defined up to two triangles per thread. Since the warp size is 32 and the quadratic patch generated by the workgroup has 25 vertices (5x5 vertices and 32 triangles), the thread utilization was “only” 78%. Comparing to the instance-based rendering with the same patch size, the mesh shader exposed about 3.5% lower performance.

Guided by the thought that increasing the threads’ utilization, alongside with higher vertex post-transform cache hits (because of bigger meshlet), I “packed” several vertices in a single thread (i.e. mesh shader invocation). Only a 2-vertex-per-thread implementation had approximately the same performance as instance-rendering (about 0.2% better). All other implementations (from 1 up to 8 vertices per thread) had worse performance). The largest meshlet (256 vertices and 450 triangles) exposed almost 18% lower performance than instance-renderer with the same size instances.

I would like to hear about your experiences with mesh shaders. Does anyone know how to efficiently render a grid with mesh shaders? It is expected that a well-optimized classical vertex-shader-based renderer is hard to beat, but a new pipeline should expose at least some improvement. What am I doing wrong?

Mesh shaders are a tool that deals with fairly specific problems in the standard pipeline. Not every workload is best suited to them. As such, you have to think about whether what you’re trying to render fits into their paradigm effectively.

You say you’re trying to “render a grid”. That doesn’t sound like the kind of scenario where mesh shaders would be helpful.

Alfonse, thank you for the response. You are right. For efficient work, one should have a proper tool. However, mesh shaders are a new (alternative, fast, modern, etc.) way to do anything. Task/mesh shader geometric pipeline comes with lots of benefits, like:

  • Higher scalability and flexibility – less impact of the fixed-function parts of the pipeline (no primitive fetch, no predefined patterns and use of thousands of general-purpose cores instead of less numerous specialized ones).
  • Efficient (parallel) cluster culling and level-of-detail control in task shaders – the greatest improvement is actually in reduction of the number of triangles being rendered.
  • A single-pass solution – The previously mentioned benefits could be achieved with compute shaders also, but with the price of two passes and the need to serialize output (i.e. index buffers for primitives to render) and the use of indirect drawing.
  • Less memory utilization – attributes can be packed/unpacked/interpolated in shaders (ability to fetch uninterpolated data with NV_fragment_shader_barycentric extension in the fragment shader).
  • Avoiding index buffer bottleneck – both bandwidth saving (large index values) and sequential rescanning for primitive fetch.
  • Distributed dispatcher – the task shader can spawn thousands of mesh shader workgroups which should be more efficient than indirect drawing or instancing (at least it is what can be read from the existing “literature”).

Everything mentioned above is theoretically true and looks fine and shiny. But in the reality, it is somewhat different. Mostly because of the high level of responsibility transferred to programmers and sometimes having no clues what’s going on under the hood. On the other hand, it is very challenging to compete with the highly optimized drivers developed by men knowing what exactly happens in their hardware and spending years to squeeze the performance. Of course, they have to do it for the general case and some clever implementation may squeeze a bit more performance for the particular one. In order to shed some light on “opened questions” and try to reveal some implementation specifics.

Let’s start with the vertex post transform cache…

A temporary memory block used for output and shared variables is limited to 16kB. Its size actually depends on max values defined in layout qualifier (max_vertices, max_primitives), type of primitives being generated, additional per-vertex and per-primitive output variables, as well as shared variables. It is quite clear and acceptable. It seems that the max_vertices is limited to 256 only for the reason to store that count in a single uint_8 value. A triangle count for the 256-ver quadratic mesh is 450, so the next higher suitable value for the primitive count is 512. (And, yes, there is a story of 126-primitive-entry blocks for indices, but that’s currently out of the scope and should limit the primitive count to 504.) Apart from this, I cannot imagine the reason for the previous limitations. The real reason could be the proper size of the vertex post transform cache and the limiting size of a meshlet to fit in the cache. On the other hand, presentations and blogs mention proper vertex packing for better cache utilization, which implies that the vertex post transform cache (and primitive assembly stage) is decoupled from the mesh shaders.

I needed to write the whole previous paragraph just to ask a question: What happens when a cache miss occurs? Answer to this question may reveal what might cause a performance loss. The vertex shaders are also batched into warp-size groups (that meshlet should correspond to), but maybe they can be reorganized in the case of a cache miss and rebatched just needed ones. Meshlet is a much heavier vehicle to restart.

The other question is related to the maximum number of vertices per meshlet: Why it is not confined to the cache size if mesh shader temporary memory is not used as a cache?

One of the benefits, according to the existing blogs and presentations, is using distributed dispatcher – the task shader. Both instancing and indirect drawing serialize primitive dispatch. Each task shader may spawn multiple mesh shader workgroups removing one possible bottleneck. That leads to the third question*: Does glDrawMeshTasksNV() suffers from the same syndrome (“serialized dispatcher”) if there is no task shader, since dispatching is done through the single point from the API?*

The answer to a previous question may lead to the answer of the following one: Does “tree expanding” (spawning more than a single mesh workgroup per a task workgroup) boost performance? It probably depends on many factors. Task shaders come with a certain price. This question assumes that their purpose is only to spawn mesh shaders.

The last question is not related to the mesh shaders directly, but serves to estimate the impact of the SSBO access to mesh shader performance. Is there any extension that enables glBindBufferRange() usage with offset granularity finer than 16 bytes? std430 layout allows direct access to all data, but offset aligned with 16B prevents me to emulate the full functionality of the mesh shader (in my application) with instancing and SSBO (instead of attributes and devisor). The usage of UBO might improve performance, because SSBO is a more general concept, but the limited size and inability to support arrays of arbitrary length make them inappropriate.

These are certainly not all the questions that bother me these days, but the current post is already long enough. In order to give some chance to get any answer (a allow skipping all previous text), I’ll repeat questions at the end. :slight_smile:

  • What happens with mesh shaders when vertex post transform cache miss occurs?
  • Why the max_vertices value is not confined to the cache size if mesh shader temporary memory is not used as a cache?
  • Does glDrawMeshTasksNV() suffer from the syndrome of “serialized dispatcher” if there is no task shader, since dispatching is done through the single point from the API?
  • Does “tree expanding” (spawning more than a single mesh workgroup per task workgroup) boost performance?
  • Is there any extension that enables glBindBufferRange() usage with offset granularity finer than 16 bytes? (Not a mesh shader related-question.)

Just because something is new does not make it fast. Your list of advantages reads like advertisements, not answers to the question of whether you should use a thing for a particular purpose.

Consider a drill. With the right head, a drill is perfectly capable of screwing in screws, as well as doing other things. But… we still have and use manual screwdrivers. There’s a reason for that.

Yes, the mesh shading pipeline is good at doing fine-grained LOD. Yes, index buffers can be a bottleneck in some applications. Yes, task shaders can be quite good at generating large number of disparate rendering requests, or culling them outright.

But you’re rendering a grid. You’re not doing fine-grained LOD-ing. A regular grid doesn’t have a bottleneck with index buffers; they get great vertex-reuse and you can limit yourself to 16-bit indices very easily (using multi-draw indirect). And the act of rendering a regular grid is exceedingly straightforward.

Let’s end there. You’re using a mesh shader; the post-T&L cache does not apply to you. As such, this question:

is meaningless. It requires a circumstance predicated on the existence of a thing that does not exist.

You can think of a mesh shader’s output variables as a cache, where the indices are fetching particular elements from the cache of vertex data. And maybe that’s exactly how a particular post-T&L cache implementation works.

But since it’s not happening automatically and requires explicit user intervention, I don’t think it’s reasonable to call it a “cache”. It’s just indexing an array.

Because the index list has limitations all its own. The vertex arrays you’re writing to have to be indexed. And whatever index list lives behind gl_PrimitiveIndicesNV may not have 32-bit precision.

Performance is rarely as simple of a matter as “do this, that makes things faster”. That being said, I cannot think of a reason why you would ever have a fixed 1:1 ratio of tasks to meshes. That would mean each task is doing very little.

Remember what the mesh shader paradigm is meant to accomplish. Each task shader workgroup generates a series of meshes. That only makes sense if there is parallelism to exploit in the generation of those meshes, and that only makes sense if a single task work group routinely generates multiple meshes.

No. Implementations are what exposes the limitation, not OpenGL. If the SSBO offset alignment is set to 16, that means the hardware cannot accommodate a lower alignment.