glDrawElementsInstanced and UBOs


I’m currently upgrading my calls to be using glDrawElementsInstanced instead the typical glDrawElementsInstanced times N, but while investigating, I came across different ways of doing it, mostly by using the matrices and instance data inside the VAO separated by glVertexAttribDivisor.

The issue here is that I have a preloaded set of VAOs, and then at runtime, depending on some logical conditions those will be rendered using glDrawElements or glDrawElementsInstanced, there is no way to know beforehand when will they be used, therefore I cant generate the meshes with the vertex data for the instance matrices.

So what I’m asking/leaving for discuss here is, if I can use glDrawElementsInstanced with UBOs in order to pass the many object data onto the shader, for instance:

  • Creating the object meshes with VAOs (vertices, normals and uvs)
  • Creating UBO and UniformBlock
  • Having 20K logical objects
  • Updating ModelViewMat x ModelViewProjMat x NormalMat UBO for each object (20K times)
  • Calling the glDrawElementsInstanced

The main idea here is having a way to render repetitive object data and also keeping the old way of rendering VAOs normally when needed on the fly.


  • Would this flow be a good starting approach?
  • Isn’t updating of the UBO 20.000 times as bad as calling glDrawElements 20.000 times?
  • Can I have that many data in the UBO being stored that way?


When calling glDrawElementsInstanced you’ll have all per instance data in a buffer that you can index with gl_InstanceID (since there is no way to overwrite uniform data with new values before each instance is drawn, they are all drawn as part of the one draw call).
You can map that buffer once per frame, write the per instance data, unmap it and then issue your draw call(s). Depending on which GL version you are targeting you could also persistently map the buffer (probably ping-pong multiple buffers to avoid clobbering data the GPU still uses).
Note the UBO size limits probably mean you’ll need multiple to fit all matrices for your 20k objects or you can use a storage buffer.

Yes, I already have a working implementation of the Instancing process, using only the gl_InstanceID to affect the rendering position in the vertex shader. (like gl_Position.x = vertex.x + gl_InstanceID)

When you say, write the per instance data, unmap it and then issue your draw call(s). you mean updating some uniform structure and send it to the shader?

My main concern was if updating every frame 20k of matrix data would end up choking up the pipeline and/or become too process heavy to change, or is it a the usual way of doing it?

About this SSBO, thank you for sharing it, I’ll look into it, but I still have this doubt that if this way of thinking is the best way for a lot of possible objects that could be changing constantly every frame. (ex: particles, voxels, etc)

Well, assuming you can’t move the calculation of those matrices to the GPU (e.g. in a compute shader) and they actually change (almost) every frame I’m not sure there are many other options.
Otherwise, calculating the UBO/SSBO contents on the GPU can save you PCIe bandwidth or you can try to partition your objects into ones were the matrices remain unchanged for a while and those with frequent changes (not sure if that is possible for your application).
It’s probably most (developer time) efficient to first try the simple “update all 20k objects each frame” approach, measure whether it is fast enough and if not determine the bottleneck.

That was exactly what I had in mind, but like you said, that is a concern for the application rather than graphical part. and yes, optimizing the data in order to cluster it in different blocks is a must have on this case.

Once more, thank you on suggesting the SSBO :ok_hand:

Just adding because it might not be clear - you don’t update the buffer object 20k times per frame. That will kill your performance. The optimal approach is to update it once per frame only, and there are several approaches that can enable you to do this.

One is to map the buffer, write your 20k matrices to the returned pointer, unmap when you’ve written them all, then draw. Be careful to not read from the pointer, as that may not work and even if it does it will break driver optimizations anyway. Experiment with and without orphaning, try different flags to glMapBufferRange, or just use plain old glMapBuffer - there is no single approach that is best in all cases and what works well in one use case may not in another.

If you have persistent mapping available you can also try with that.

Another valid approach is to write the 20k matrices to a system memory array, then glBuffer(Sub)Data them to your GL buffer object. This incurs an extra memory copy but can often be a very good trade-off for performance and code simplicity (you might even already have them in such an array - even better if so). Sometimes the driver can put this on a “good enough” fast path that’s better than what you might get otherwise, without significant effort. Again, experiment with usage flags and glBufferData vs glBufferSubData to find which works best in your use case.

Related to this discussion, this wiki page is a good read:

Yes, it is obvious to understand that it must be updated once per frame with the 20K data element at once by mapping and then unmapping it in one go.