GPU Instancing Performance Question

Hi all,
One question comes to me about GPU Instancing recently when I am studying game performance optimization stuffs:

Generally, do GPU Instancing always bring performance boost, compared to equivalent single mesh drawcalls for centern number of instances?

I am not sure that this is a proper question. Or maybe there is even no ‘general’ circumstance to think about comparision between GPU Instancing and single mesh drawcall. Actually, I want to know what GPU Instancing can really boost and it’s tradeoff. I know that typically we use GPU Instancing when a program/an application encounters CPU-bound performance issue when rendering. In such a case, we originally draw, for example, 10000 mesh instances with 10000 single mesh drawcalls, while we only need one drawcall via GPU Instancing.

We say there are N instances in my scene with the same mesh and material information, but with different ObjectToWorld Space Transformation matrix and per-instance data for rendering. From my perspective, if I use N single mesh drawcall method, CPU need to upload matrix and per-instance data to GPU for each instance and also launch a drawcall for each. But if GPU Instancing applied, I can put all matrix and instance data in vertex buffer, and upload them to GPU once. And then one drawcall is made. I think, for the same size of data, uploading them in one chunk is faster then uploading them in several subchunks. And also one drawcall should be better than bunch of drawcalls. If my understanding above is right, it seems that GPU Instancing should be always used when rendering several instances if they meet GPU Instancing requi rements. Otherwise, what drawbacks do GPU Instancing have?

It reduces the memory consumption (and thus bandwidth). It allows you to render m*n primitives with only O(m+n) data. The tradeoff is that all instances must have the same topology and the vertex shader outputs must be capable of being formed from either the per-vertex data (which is the same for all instances) or the per-instance data (which is the same for all vertices). That last restriction can always be circumvented by using uniform arrays, SSBOs, or textures, but dependent fetches have a performance cost relative to attribute data as the GPU can’t fetch the data until the vertex shader has determined the indices so you’re adding memory latency.

Additionally, small instances are inefficient as (AFAIK) the GPU typically won’t process multiple instances within a single work group. So if you have e.g. 80 vertices per instance and the GPU performs 64 invocations per work group, you’re actually getting 128 invocations (two work groups) per instance, for 62.5% utilisation.

Instancing is almost always going to beat multiple draw calls. The question is whether it’s better to use instancing or to just render everything with distinct vertices then figure out how to avoid specifying per-object data for each vertex. Typically, you’d either add an integer attribute which can be used to index into a uniform array (or, to avoid size restrictions, a texture or SSBO), or if all objects are the same size use gl_VertexID/vertices_per_object.