Reduce draw calls single VAO

pogrom · October 16, 2018, 2:49am

Hi community !

I am new here but I believe I have a correct knowledge of OpenGL in general.
In my current application I can have multiples VAO. Let know focus on a single one. I have a spatial partitioning system which splits the indices contained in a VBO in this VAO.
Currently for each traversed nodes on this spatial partitioning I am calling a draw call (glDrawElementsBaseVertex), but this can lead to many many calls, which for sure impacts the overall performance.
I am now considering glMultiDrawElementsBaseVertex as a first improvement.

Now, what I would like to know is what are the current state of the art for this in OpenGL ? For sure it will depend on the version supported, but I will consider this later. Let say OpenGL 3.3 and later versions.

I have seen that instancing could help but I am not sure to understand how.
Also there are the indirect draw calls but if I did not misunderstood what I read, they require the use of another shader.
I also read that it is possible to feed dynamically the Element Buffer Object. Will this be as efficient as the other candidates ?
glMultiDrawElements looks to be a good candidate but it seems to be a simple loop over glDrawElements (depending on the implementations I guess).

What do you suggest about this ?
Thank you in advance and forgive me if that is not a good question.

GClements · October 16, 2018, 3:24am

[QUOTE=pogrom;1292768]
I am now considering glMultiDrawElementsBaseVertex as a first improvement.

Now, what I would like to know is what are the current state of the art for this in OpenGL ? For sure it will depend on the version supported, but I will consider this later. Let say OpenGL 3.3 and later versions.[/QUOTE]
glMultiDrawElementsBaseVertex() requires OpenGL 3.2.

Instancing is used if you want to draw multiple copies of the same object. The copies don’t need to be identical in every regard (there wouldn’t be much point in that), but they do need to have identical topology, and their attributes are a combination of per-vertex attributes (which are identical for all instances) and per-instance attributes (which are identical for all vertices). Essentially, instancing generates vertex attributes as a Cartesian product. Instancing was added in 3.1 but really needs 3.3 to be useful (that’s when per-instance attributes were added; prior to that, you had to “fake” per-instance attributes using uniform arrays or textures).

The *Indirect functions take most of their parameters from memory, which can be a buffer object bound to GL_DRAW_INDIRECT_BUFFER. The advantage of this is that it allows the parameters to be generated by a shader, without the need to copy those parameters out to client memory (which requires CPU-GPU synchronisation). If you aren’t generating the parameters via a shader, there isn’t much point in using the *Indirect functions.

It depends upon the amount of data involved and whether you can update the buffer without requiring synchronisation.

Well, being a single call reduces the overhead to some extent, however minor. It also avoids the need for the driver to check for state changes between draw calls.

It’s impossible to make recommendations from such generalised information; performance will depend upon the details. Ultimately, the most reliable way to gauge performance is to profile the actual code on the target hardware.

pogrom · October 16, 2018, 5:42am

Yes I know this is quite general. I was first trying to check if there were common use-cases to avoid pitfalls of many draw calls from within the same VAO. Mainly in order to avoid to build support fro instancing or indirect rendering.

From your answers glMultiDrawElements functions will help a bit at least. And the solution to feed the index buffer on the fly is also an option to consider.

I also understand that instancing is useful only if repetitions exist in what I want to draw, which is not the case. I also understand that indirect is useful if I can generate the parameters on the GPU which will imply to do the spatial partitioning on the GPU that is currently not the case either, but for sure this is another option for a more far future.

Thank you for your answers which were very helpful.

Alfonse_Reinheart · October 16, 2018, 6:37am

Draw calls are not a performance problem (not really). The performance problem is the state changes between draw calls, as detailed in this presentation. If you don’t actually have any such state changes, then you shouldn’t worry about it.

Use the MultiDraw* functions if you can, but your program probably has other inefficiencies that you should be more concerned about than a few back-to-back glDraw calls.

pogrom · October 16, 2018, 7:37am

Thank you for your answer.

Indeed, I am aware of this. This is why my draw calls are sorted so that the minimum of state change is done per-frame. Since some times I am also considering to use bindless textures to remove more state change, as the next step of this improvement. And from both your answers it seems that indirect rendering is where I should go, because if I understand it well, it can reduce even more state changes (VAO/VBO bindings if I’m not wrong at this point of my understanding).

Regards.

Dark_Photon · October 16, 2018, 8:40am

In general, this is not true. Yes, it allows you to generate GPU draw call parameters on the GPU, but that’s not their only use.

Consider the old-style glMultiDraw calls. These couldn’t be handled very efficiently, partly because the params used by separate draws couldn’t be placed on the GPU. With glMultiDraw*Indirect (MDI), they can. This permits submitting multiple draw calls worth of draws efficiently. And not just instanced draw calls which use the same geometry. Any draw calls which use the same primitive type.

So to Alfonse’s comment, yes. Minimize your state changes as a top priority, being conscious of the relative cost of state changes when you do. Then, use MDI to batch the heck out of your geometry between state changes, being sensitive to frustum culling efficiency (e.g. no sense in throwing millions more verts at the GPU than necessary only to have the GPU culler throw them out just before fragment shading).

pogrom · October 16, 2018, 12:00pm

I am curious. Why old-style ? This seems relevant to give me a better understanding of what is the new way to render. Is glMultiDraw old because glMultiInstanced and MDI should now be the prefered choice ? Or is this just related to the fact that glMultiDraw exists since GL 3.2 whereas the other exist since GL 4+ only ?

Yep for the state changes. This is however a more slow process to make the full thing more what I’d like it to be than what it currently is.

Thank you all a lot !

Alfonse_Reinheart · October 16, 2018, 12:21pm

Consider the old-style glMultiDraw calls. These couldn’t be handled very efficiently, partly because the params used by separate draws couldn’t be placed on the GPU.

Um, why wouldn’t they be able to be placed on the GPU? What you’re talking about is no different from writing data to a persistent buffer and then telling OpenGL to read from it. I can’t imagine the driver would implement direct glMultiDraw commands in a less efficient way than that. Indeed, direct glMultiDraw command could conceivably just copy your client data directly into the GPU FIFO, rather than having to use buffer object storage.

Naturally, that assumes that the individual draws are dynamic, so you’re not just picking data from static buffers. In the static case, then you could possibly get some advantage from it, depending on driver implementation. But even then, it seems rather unlikely, so long as the amount of sequential draws is not particularly large.

Dark_Photon · October 16, 2018, 7:25pm

Hearken back to client arrays. Pretty fast, right? But not as fast as rendering from VBOs with bindless. Why? Partially because the arrays had to be continuously re-uploaded to the GPU (or GPU accessible memory such as pinned/AGP mem). However, with VBOs the contents is already at least in driver memory, and with bindless you can ensure the contents are GPU read-ready (either on the GPU or in pinned memory; your pick).

So now, with glMultiDrawArrays(), where do the first and count arrays come from? And with glMultiDrawElements(), the count and indices arrays? Yep, same thing. They have to be re-streamed each time, staged to GPU-accessible memory with proper synchronization, before the batches can be rendered. That’s different with MDI (e.g. glMultiDrawArraysIndirect, glMultiDrawElementsIndirect), where everything the GPU needs is at least in the driver, and with bindless either on the GPU or in GPU-accessible memory.

So there is greater potential for MDI to require less overhead in order to queue and execute MultiDraw calls.

Caveat: As always, individual drivers determine what is most efficient, so (to any reader of this thread) profile carefully on the drivers and GPUs you care about. There’s nothing stopping vendor drivers from having a fast path for MultiDraw* but having dog-slow MultiDraw*Indirect implementation on certain hardware. …Or vice versa.

Naturally, that assumes that the individual draws are dynamic, so you’re not just picking data from static buffers.

But they need not be. In fact, I was considering the static case (where you upload once, and rerender many).