Alfonse, thank you for the response. You are right. For efficient work, one should have a proper tool. However, mesh shaders are a new (alternative, fast, modern, etc.) way to do anything. Task/mesh shader geometric pipeline comes with lots of benefits, like:
- Higher scalability and flexibility – less impact of the fixed-function parts of the pipeline (no primitive fetch, no predefined patterns and use of thousands of general-purpose cores instead of less numerous specialized ones).
- Efficient (parallel) cluster culling and level-of-detail control in task shaders – the greatest improvement is actually in reduction of the number of triangles being rendered.
- A single-pass solution – The previously mentioned benefits could be achieved with compute shaders also, but with the price of two passes and the need to serialize output (i.e. index buffers for primitives to render) and the use of indirect drawing.
- Less memory utilization – attributes can be packed/unpacked/interpolated in shaders (ability to fetch uninterpolated data with NV_fragment_shader_barycentric extension in the fragment shader).
- Avoiding index buffer bottleneck – both bandwidth saving (large index values) and sequential rescanning for primitive fetch.
- Distributed dispatcher – the task shader can spawn thousands of mesh shader workgroups which should be more efficient than indirect drawing or instancing (at least it is what can be read from the existing “literature”).
Everything mentioned above is theoretically true and looks fine and shiny. But in the reality, it is somewhat different. Mostly because of the high level of responsibility transferred to programmers and sometimes having no clues what’s going on under the hood. On the other hand, it is very challenging to compete with the highly optimized drivers developed by men knowing what exactly happens in their hardware and spending years to squeeze the performance. Of course, they have to do it for the general case and some clever implementation may squeeze a bit more performance for the particular one. In order to shed some light on “opened questions” and try to reveal some implementation specifics.
Let’s start with the vertex post transform cache…
A temporary memory block used for output and shared variables is limited to 16kB. Its size actually depends on max values defined in layout qualifier (max_vertices, max_primitives), type of primitives being generated, additional per-vertex and per-primitive output variables, as well as shared variables. It is quite clear and acceptable. It seems that the max_vertices is limited to 256 only for the reason to store that count in a single uint_8 value. A triangle count for the 256-ver quadratic mesh is 450, so the next higher suitable value for the primitive count is 512. (And, yes, there is a story of 126-primitive-entry blocks for indices, but that’s currently out of the scope and should limit the primitive count to 504.) Apart from this, I cannot imagine the reason for the previous limitations. The real reason could be the proper size of the vertex post transform cache and the limiting size of a meshlet to fit in the cache. On the other hand, presentations and blogs mention proper vertex packing for better cache utilization, which implies that the vertex post transform cache (and primitive assembly stage) is decoupled from the mesh shaders.
I needed to write the whole previous paragraph just to ask a question: What happens when a cache miss occurs? Answer to this question may reveal what might cause a performance loss. The vertex shaders are also batched into warp-size groups (that meshlet should correspond to), but maybe they can be reorganized in the case of a cache miss and rebatched just needed ones. Meshlet is a much heavier vehicle to restart.
The other question is related to the maximum number of vertices per meshlet: Why it is not confined to the cache size if mesh shader temporary memory is not used as a cache?
One of the benefits, according to the existing blogs and presentations, is using distributed dispatcher – the task shader. Both instancing and indirect drawing serialize primitive dispatch. Each task shader may spawn multiple mesh shader workgroups removing one possible bottleneck. That leads to the third question*: Does glDrawMeshTasksNV() suffers from the same syndrome (“serialized dispatcher”) if there is no task shader, since dispatching is done through the single point from the API?*
The answer to a previous question may lead to the answer of the following one: Does “tree expanding” (spawning more than a single mesh workgroup per a task workgroup) boost performance? It probably depends on many factors. Task shaders come with a certain price. This question assumes that their purpose is only to spawn mesh shaders.
The last question is not related to the mesh shaders directly, but serves to estimate the impact of the SSBO access to mesh shader performance. Is there any extension that enables glBindBufferRange() usage with offset granularity finer than 16 bytes? std430 layout allows direct access to all data, but offset aligned with 16B prevents me to emulate the full functionality of the mesh shader (in my application) with instancing and SSBO (instead of attributes and devisor). The usage of UBO might improve performance, because SSBO is a more general concept, but the limited size and inability to support arrays of arbitrary length make them inappropriate.
These are certainly not all the questions that bother me these days, but the current post is already long enough. In order to give some chance to get any answer (a allow skipping all previous text), I’ll repeat questions at the end.
- What happens with mesh shaders when vertex post transform cache miss occurs?
- Why the max_vertices value is not confined to the cache size if mesh shader temporary memory is not used as a cache?
- Does glDrawMeshTasksNV() suffer from the syndrome of “serialized dispatcher” if there is no task shader, since dispatching is done through the single point from the API?
- Does “tree expanding” (spawning more than a single mesh workgroup per task workgroup) boost performance?
- Is there any extension that enables glBindBufferRange() usage with offset granularity finer than 16 bytes? (Not a mesh shader related-question.)