Setting up an efficient renderer in OpenGL

SpiderPig · August 29, 2025, 6:27am

Hello,

I’m creating my own game engine and need to improve the performance of my render loop.

I am testing with a 16x16x16 grid of cubes with unique vertices and indices - so no instancing - totaling 4096 cubes.

Currently, all my vertices and indices are in the same buffer. (not sure if this is a good idea or not)

I bind the following before drawing.

glBindVertexArray(m_vao);
glBindBuffer(GL_ARRAY_BUFFER, m_buffer);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, m_buffer);

Each mesh has it’s own draw command held in GL_DRAW_INDIRECT_BUFFER which is also bound before drawing.

I then loop through each mesh and draw using the appropriate command buffer index. (mesh->buffer_id)

glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (const void*)(mesh->buffer_id* sizeof(DrawBufferCommand)), 1, 0);

Which totals 4096 draw calls.

I’d like to speed up this approach and the first thing I tried was store the meshes in order in the DRAW_INDIRECT_BUFFER so that I could pass in the count as 4096 and call DrawElementsIndirect() once rather than 4k times. Like this.

uint32_t start = 0;
uint32_t count = 4096;
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (const void*)(start * sizeof(DrawBufferCommand)), count, 0);

However it provided no speed improvement. the FPS were exactly the same which made me think that under the hood of that function it might be calling the draw comand to the GPU the same amount if times as specifed in “count”.

Anyway, I’m unsure on what the best approach to take here is.

Being that all my vertex and indice data is in one array, I could simply create one draw command with the relevant vertex, indice offsets and counts to treat all 4k cubes as one mesh. It’ll draw okay, however entity ID’s for transforms would then have to be a per vertex attribute - which might be okay but further down the track I see no way of culling unwanted meshes - apart from using a compute shader to edit the indice’s so that offscreen geometry is not in the indice list. Don’t know about performance but might be worth testing.

I am hoping someone smarter than I am can point me in the right direction!

Am I using the draw command correctly?
Have I got the right idea?
How do modern game engines render large amounts of geometry?
I know a compute shader can fill the DRAW_INDIRECT_ARRAY but can a shader dispatch the draw? I mean, can it basically call it’s own equivalent of DrawElementsIndirect()?

Any help here is appreciated.

Dark_Photon · August 29, 2025, 12:54pm

SpiderPig:

I am testing with a 16x16x16 grid of cubes with unique vertices and indices - so no instancing - totaling 4096 cubes. …

Each mesh has it’s own draw command held in GL_DRAW_INDIRECT_BUFFER which is also bound before drawing.

I then loop through each mesh and draw using the appropriate command buffer index. (mesh->buffer_id)
   glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (const void*)(mesh->buffer_id* sizeof(DrawBufferCommand)), 1, 0);
Which totals 4096 draw calls.

I’d like to speed up this approach …

Ok. First thing is then to profile and determine what your biggst bottleneck is.

… the first thing I tried was store the meshes in order in the DRAW_INDIRECT_BUFFER so that I could pass in the count as 4096 and call DrawElementsIndirect() once rather than 4k times. Like this.
uint32_t start = 0;
uint32_t count = 4096;
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (const void*)(start * sizeof(DrawBufferCommand)), count, 0);
However it provided no speed improvement.

I’d guess that wasn’t your biggest bottleneck. Or even a bottleneck at all.

I’d grab a profiler so you can see how this work is queued and executed in the GL driver. That’ll give you some insights as to what you can do to speed this up.

Here’s one possibility you might check into. How big (num vertices) is each subdraw call in your glMultiDrawElementsIndirect (MDI) draw call? It sounds like very small.

Be aware that many GPUs are extremely inefficient at rendering small MDI subdraws (few vertices per subdraw). What happens is that the GPU can’t pack vertex shader executions for different subdraws in the same warps/wavefronts. And so you end up with very low thread occupancy (many possible vertex shader thread slots left unused). This results in very low vertex transform throughput.

To fix, pack those repeated primitives (e.g. cubes) into the same subdraw but using geometry instancing within it. For example, see the instanceCount and baseInstance fields in the DrawElementsIndirectCommand subdraw struct for glMultiDrawElements(). In my experience, at least on NVIDIA, the driver+GPU “does” pack vertex shader executions for separate instances within the same MDI subdraw into shared warps. The result is much higher vertex throughput (less GPU time spent running vertex shaders).

Another solution is to pack more instances in each subdraw, pseudo-instancing style. However, if the geometry is really repeated (literally instanced), it’s unclear why you’d want to do that.

Another solution is to use mesh shaders, which permit you to repack your vertex work into whatever thread group sizes you’d like. That said, these are more work and not available everywhere.

In your case, I’d just try using geometry instancing within one subdraw of an MDI draw call. Given what you have working, this should be trivial to try. Or you could just ditch the MDI and use a non-MDI instancing draw call (like glDrawElementsInstanced). Up to you! Either one is a simple mod from where you are.

SpiderPig · August 29, 2025, 9:17pm

Thanks for the info! I’ve started profileing but NSight seems to fix the issue - I’ve posted a seperate topic here about that : Nsight gives better performance for app

You’ve given me a lot of things to try so I think once NSight is working correctly I can work something out.

SpiderPig · September 1, 2025, 9:35pm

Solved the Nsight issue here.

system · March 3, 2026, 9:35pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.