Low performance model with instance rendering. Too many glDraw calls

rbasniak · July 1, 2015, 5:37am

Hi,

I’m just started learning OpenGL and this is my first project besides tutorials. I’m trying to load a huge enginering model:

[ATTACH=CONFIG]1135[/ATTACH]

The data is structured in a way that I thought I could use instancing for very good performance, because many items are simple primitives (cylinder, torus, cones, etc) with different transformation matrices. So I structured my data this way:

Primitive 1
VAO
[INDENT] VBO -> vertices
VBO -> indices

Instance 1.1
[INDENT]  VBO -&gt; object color
  VBO -&gt; transformation matrix

Instance 1.2
  VBO -&gt; object color
  VBO -&gt; transformation matrix

Instance 1.n
  VBO -&gt; object color
  VBO -&gt; transformation matrix[/INDENT][/INDENT]

Primitive n
VAO
[INDENT]VBO -> vertices
VBO -> indices

Instance n.1
  [INDENT]VBO -&gt; object color
  VBO -&gt; transformation matrix

Instance n.n
  VBO -&gt; object color
  VBO -&gt; transformation matrix[/INDENT][/INDENT]

The model has ~50k unique primitives and ~120k instances, so in general each primitive has 2 instances. But in practice, some have 10 and some have only 1. I end up with 50k VAOs and then call dlDrawElementsInstanced for each VAO.

The model is being draw at only 3 fps. I gues it’s because of the number of glDraw calls (50k for the triangles and 50k for the outlines). The shaders are very simple, not even lightning is being applied.

To be sure of that I changed the way I organize the buffers: I put everything in a single buffer, but to do that I had to pass the color and transformations matrices as vertex attributes to each vertex. I know this isn’t right but I had to test if the problem was the number of glDraw calls. I couldn’t even load the model this way because the buffers get way to big and I get out of memory. So I tested this theory on a smaller model that was taking 50ms to render in the instanced way. With a single VAO for everything and only a single glDrawElements call the model is taking less than 1ms to render.

I know that putting everything in a single VAO and passing the trasnformations for every vertex is not the correct way of doing this. Now I also know that instancing isn’t the tool for this due the very few intances for each mesh. So the question is, what would be the correct way to setup the buffers to minimize the number of glDraw calls?

I think that it would be perfect to store all in a single VAO (so I could call glDrawElements a single time) but use something like the glVertexAttribDivisor used in instancing to inform the shaders when to use the next shaders. But not exactly because I had to manually inform when to use the next matrix, in way like glPrimitiveRestartIndex works.

Alfonse_Reinheart · July 1, 2015, 7:53am

When you say that you have “~50k unique primitives,” do you mean that you have 50K triangles or draw calls? Because when you say “I gues it’s because of the number of glDraw calls (50k for the triangles and 50k for the outlines).,” that sounds like you mean draw calls, not triangles.

It isn’t technically wrong to use “primitive” to mean “draw call,” but the OpenGL specification uses “primitive” in multiple ways, depending on what part of the pipeline it’s talking about. So it’s best to be more explicit about what you mean.

In any case, the way you’re using instanced rendering is not helping you. Instancing of this form (that is, sending the same mesh data with different instance data) is generally only useful performance-wise if all of the following are true:

The mesh you want to render instanced is relatively small, in terms of number of vertices, but not too small (at least ~100 vertices, up to around ~5000 or so)
The number of instances of this specific mesh being rendered is large (>1000)

And when I say “mesh”, I mean “thing that can be rendered with one draw call”.

In your case, you’re only rendering a couple of instances per draw call, so you’re basically degrading your performance compared to just rendering each mesh individually.

Probably the best thing for your particular case (lots of small meshes, all rendered by the same shader, where each has some per-mesh data) is to do these things:

Put all of the mesh vertex data in the same buffer object, directly adjacent to one another. So if your first mesh has 24 vertices in it, then the very next index of the buffer will be the first vertex of the second mesh.

Note: this discussion assumes that you are using indexed rendering for your meshes. Your indices should not be offset from the beginning of the buffer. That is, even though the first vertex of the second mesh is at index 24 in the array, the first index for the second mesh should still say 0 (if the first index of the second mesh refers to the first vertex of the second mesh). This lets you use GL_UNSIGNED_SHORT for your indices, since none of your meshes are particularly large.

Put all of the per-instance data for these meshes in one buffer object. The order of instance data relative to mesh data is irrelevant.
Use a single VAO for rendering all of this mesh data. Do not change VAOs between draw calls. The VAO should set up instancing on the per-instance data.
To render each mesh+instance, you need two things: a base-vertex offset from the beginning of the vertex data to the first vertex for that particular mesh. And an index into the per-instance data that references that particular instance data.

So if the first mesh takes up 24 vertices, then the second mesh in the array would have a base vertex of 24. If the second mesh takes up 16 vertices, then the third mesh has a base vertex of 40. Obviously, the first mesh has a base vertex of 0.

Now given these data, how best to transmit them to OpenGL is version dependent:

a) If you have OpenGL 4.3/ARB_multi-draw-indirect, the solution is simple. Build a buffer containing indirect rendering commands, which apply the base-vertex offset as the baseVertex member, with the per-instance index as the baseInstance. The instance count should be 1. When it comes time to render this series of meshes, you send a single glMultiDrawIndirect call.

b) If you have OpenGL 4.2/ARB_base_instance, the solution is slightly more complex (and slightly slower). Instead of a single call, you have to make one call per mesh+instance. However, you do not change any state between calls. The call in question is the gigantically named glDrawInstancedBaseVertexBaseInstance. The baseVertex, baseInstance, and instanceCount fields are exactly as in the prior case.

If you don’t have access to any of those features (and do note that base_instance is supported on plenty of non-4.x hardware, so you really should have it), then things get less performance-friendly for you. The goal of all of this was to avoid state changes between draw calls. Without the use of baseInstance here to allow us to change per-instance data with just the draw call, that’s not really possible.

Given that, your best bet for performance is to put your instance data in UBOs. This means that, between each draw call, you have to change UBO state. But you still don’t change VAO state, and you still use baseVertex to pick the mesh. You just have a glBindBufferRange call to pick the per-instance data from a UBO. All of your instance data should be in the same UBO, so you’re just selecting sections from it.

Also, remember that UBO binding has an alignment to it. So you may need to pad your data structure to match that alignment.

Asmodeus · July 1, 2015, 8:12am

Well with my batch renderer i have rendered about ~10 000 Trees - DrawCall (each tree has about 1100 vertices) that is more than 11 000 000 vertices. The Trees pass thru dynamic light and texturing shader. That runs on ~50 FPS , i forgot how much millsecs it was taking to render the frame

Grognard · July 1, 2015, 12:52pm

If I understand what you are trying to say, there is no way to fix this. Looks like you are trying to render a model with tons and tons of triangles. An instance would be something where there’s many things exactly the same you can render boom boom boom all at once. In this crazy looking scene that does not seem to be the case.

You are best with a scene like this to put as many of its triangles as you can in one batch. But it’s still going to be slow.