Instancing with EXT_draw_instanced


We’ve been recently tried this extension, and, unfortunately, couldn’t get any benefit of using it.

We wrote simple test app, where we reproduced our current pseudo-instanced render scheme: rectangular grid of NN cells, with M objects instances inside each cell. Each instance was a simple torus with V vertices.
So, we had N
N draw calls, pseudo-instancing was done with vertex attributes.

Then, we used this new extension along with EXT_texture_buffer_object, NV_parameter_buffer_object and uniforms array (just to try, what is better). Because of instancing we had much less draw calls (1 call with TBO, NNM/4096 calls with param_buf_obj and NNM/1024 calls with uniforms array).

And we were dissapointed with it’s perfomance - we got FPS boost ONLY if number of instances per cell was very low (1-5 objects in cell), just like if each instance were rendered with separate draw call. In any other cases we had perfomance dropped by some percents (1%-15%).

So, we stated, that this instancing has no sense, if you already have some pseudo-instanced stable scheme. And it has sense if and only if you draw your objects with separate draw calls each.

Has anyone experienced with this extension? May be, we missed something crucial?

Appreciating your replies,

I used pseudo-instancing and switched to proper instancing. I have not experienced any noticable performance difference between the two. I check at startup how many uniforms are available and then decide how many instances i can render in one batch, since every instance needs its own 4x4 matrix plus one vec3 or so, for additional data. I set the uniforms in two batches (the 4x4 matrix array in one go and the other data in one go). I can then render up to 200 instances. I usually do render several thousand instances, such that only the very last batch contains less than 200 instances.

With this use-case it works really well.

Apart from performance, proper instancing has several benefits:

For pseudo-instancing you need to replicate you mesh, several times. Depending on how big your mesh is and how much memory you are willing to sacrifice you might be limited to render only 5 to 20 instances in one batch.

Every LOD-level needs to be replicated as well, increasing your memory consumption even more.

EVERY to-be-instanced mesh needs to be replicated. This doesn’t scale well, at all, rendering ONE type of objects this way works well, but allowing users to plug in additional object types can easily bring your system down.

Depending on exactly how many objects you need to render, you might want to replicate some object only 5 times, or 100 times. E.g. in my program, i have to render > 20000 trees, so i actually replicated the low-res LODs a hundred times. With proper instancing, you do not need to make this decision, at all.

Did you actually render all objects in one cell in one batch? That means nn cells = nn instanced drawcalls? If so, you should try to batch the instances of all cells together and then render them in as few batches, as possible. E.g. try not to render 10 instances, but merely 100 or so in one batch. Also try to reduce the per-instance data that is needed. I use “instance-variations”, where i have like 20 different variations of instances (e.g. colors) and every instance that is rendered only needs its position/rotation matrix plus one variation-ID. The shader uses this ID then to look up additional information about an instance (like color and other parameters). This way i can upload the instance-variation data ONCE, instead of per instance. This frees up uniforms, which allows to render more instances in one batch, and it reduces the amount of data, that i need to upload per instance.



Surely, I render everything with only 1 render call. If I have NN cells, then I have 1 instanced rendercall or NN pseudo-instanced rendercall.
All the per-instance data is reduced to maximum - to one float4.
I understand all the benefits I would get using instanced drawing, but perfomance really annoy me.

I can upload this test program with sources, if interesting.

When I originally tried EXT_draw_instanced, I found that it was very slow compared to pseudo-instancing unless my vertex count per-instance was small. I believe around 150 vertices was the break-even point - above that and there is a sharp drop in performance. I was hoping EXT_draw_instanced would at least match pseudo-instancing in these cases, because it is such a nice API (with gl_InstanceID).

I tried re-running my test and it seems that glDrawElementsInstancedEXT crashes now when combined with GLSL shaders. (GeForce GTX 260, XP x64 178.13 drivers)

Michael Gold at NVIDIA might be interested in hearing more about this:

I never heard back, so I don’t know if what I am seeing is the intended behavior or not.

Take Michael up on his offer and send them a test app that demonstrates the problem.