Problem with offsetting gl_DrawID

Hello,
I’m currently refactoring my rendering pipeline to allow batching multiple draw calls into as few MultiDrawIndirect as possible. This resulted in pretty big changes in how I managed things, especially vertex buffers, uniform buffers & textures. I run a “typical” render loop where I submit “render tasks” which consist of all the information for a single draw call. Then I order them based on sort key which empasises shader & vertex buffers used, because these can’t be changed mid-indirect draw, so I end up with geometry ordered in a way that allows me to put ranges of it into indirect buffers (theoretically). I thought it would be good to put all my per-object uniforms into a single buffer, even if these items come from different “queues” (my geometry is split between few buckets: opaque, alpha_tested, 2d_overlay etc. which allow me to fetch specific lists of tasks to process in separate passes). So I have several queues containing RenderItems, and each RenderItem has entry in ObjectUniforms[] array which has data like model matrix, normal matrix and other per-object info.

Having one big ObjectUniforms array allows me to easily upload it to a perma mapped buffer, and each RenderItem keeps index into this array. This works nice in my initial implementation which is not using indirect draw yet, and I just provide this index by glUniform1i. Even if I sort my render tasks to minimise state changes, the indices are still valid (I do not sort the ObjectUniforms array). But with indirect draw, I won’t be able to provide these indices by glUniform1i. What I read so far is that gl_DrawID is available, but it starts with 0 for every call to glMultiDrawIndirect, which means I need to somehow offset it so I can index in the global ObjectUniforms array (it will also have to be sorted together with RenderTask array, because I won’t be able to provide arbitrary index anymore, but rather BASE + GL_DRAW_ID). How is this usually done? Should I provide single integer uniform as some baseDrawId per each glMultiDrawIndirect and then resolve the index into global buffer as baseDrawId + gl_DrawID? I’ll now obviously also have to sort the per-object uniform array, so it has the same order as render tasks, because it will have to be continuous (as I will be able to provide single offset per indirect draw, so each draw call inside a given indirect draw has to index into global uniform array as base + 0, base + 1, base + 2 - it can’t index into arbitrary random index anymore).

I saw some reference that this can be achieved by providing this offset as baseInstance, so I basically abuse this property to provide my index offset. But not sure how this affects vertex divisors as they probably use baseInstance value and this will break them, as they won’t follow the same offset as global uniform buffer. They’ll want to start from 0…

Any other options exist? I never saw this mentioned, everyone says “upload per-object uniforms to a single buffer” but this would mean I need to do it separately for every indirect call group - I can’t just upload ONE single buffer, as I can’t have a reliable way of indexing into it, when gl_DrawID always start from 0.

[QUOTE=noizex;1293368]What I read so far is that gl_DrawID is available, but it starts with 0 for every call to glMultiDrawIndirect, which means I need to somehow offset it so I can index in the global ObjectUniforms array …
How is this usually done?
Should I provide single integer uniform as some baseDrawId per each glMultiDrawIndirect and then resolve the index into global buffer as baseDrawId + gl_DrawID?[/QUOTE]

Sure, if you don’t handle the offset some other way.
…such as in the way that you bind or address a buffer object.

For instance:

[ul]
[li]If you’re binding buffer objects, see glBindBufferRange with UBOs or SSBOs. [/li][li]If you’re using bindless buffer objects (NVidia), just pass the correct GPU address of the specified offset into the buffer object into the shader directly. [/li][/ul]

I’ll now obviously also have to sort the per-object uniform array, so it has the same order as render tasks, because it will have to be continuous

Or you need some sort of lookup table in the shader to map between the orders, but I’d avoid that if possible.

I saw some reference that this can be achieved by providing this offset as baseInstance, so I basically abuse this property to provide my index offset. But not sure how this affects vertex divisors as they probably use baseInstance value and this will break them,

(UPDATE: Ignore the following! It’s wrong. See below for correction.)No. There are two main types of GPU geometry instancing:

  1. Instanced Draw Calls (glDrawInstanced + indirect versions of those)
  2. Instanced Arrays (uses vertex attribute divisors).

baseInstance is for the former, not the latter.

So unless you’re using an instanced draw call, feel free to use the baseInstance field however you want (…assuming you have OpenGL 4.2 or ARB_base_instance).

Thanks for reply! I think you made it clear now. Because I’m aiming for as many (compared to other calls, but as few indirect in general :D) indirect calls as possible I want to avoid changing addresses during the call. I could do this, but then it would have to split some indirect calls into smaller ones instead of doing it with fewer. So ideally, I’d like to index, but also be able to offset, and I’d like to avoid mentioned additional indirection by providing some map inside the shader. I think what you said about two ways to instance made it clear and I will go for baseInstance as a mean to offset my per-object data index.

[QUOTE=Dark Photon;1293383]
No. There are two main types of GPU geometry instancing:

[ol]
[li]Instanced Draw Calls (glDrawInstanced + indirect versions of those)[/li][li]Instanced Arrays (uses vertex attribute divisors).[/li][/ol]
baseInstance is for the former, not the latter.

So unless you’re using an instanced draw call, feel free to use the baseInstance field however you want (…assuming you have OpenGL 4.2 or ARB_base_instance).[/QUOTE]

Ah, that’s something I wasn’t aware of (the difference between arrays which use divisors and instanced draw calls). The way I see it, it seems that it will be consistent if I use baseInstance there to offset my data, because that’s how it would do if I were actually using instanceCount > 1 - the instance data would be fetched from the very same SSBO, based on the draw#. I plan to mainly use indirect draws which as far as I know are using *Instanced under the hood, so it basically means that whatever I draw, I draw it instanced, but sometimes the amount will be 1 (I think NVidia often used it in teir AZDO etc. presentations where instead of instancing they just used a lot of indirect calls with instance 1).

This actually leads me to another question - if I create a batching process which iterates over submitted geometry draws and tries to compact them into indirect/instanced calls - what would be a reasonable threshold where something should start to be drawn instanced rather than as several entries in indirect buffer? Like this situation:

A, A, A, A, B, B, C, C, C, C, C

Where A, B & C are separate meshes, but they use the same shader, buffers (just different offsets), textures are bindless too so basically all this can be compacted - the question is by what rules. I see two ways:

  1. Push it all to indirect buffer as they are, so it will be 11 draw calls in the buffer: A { instanceCount=1 }, A { instanceCount=1 } …
  2. Merge them into 3 indirect buffer entries, A { instanceCount=4 }, B { instanceCount=2 }, C { instanceCount = 5 }

Is there some specific rule as to when we should go for instanceCount > 1? Would that negatively impact performance if I draw just a couple of meshes as instanced? (they’re regular “game” meshes, a couple of thousands of tris).

[QUOTE=Dark Photon;1293383]No. There are two main types of GPU geometry instancing:

[ol]
[li]Instanced Draw Calls (glDrawInstanced + indirect versions of those)[/li][li]Instanced Arrays (uses vertex attribute divisors).[/li][/ol]
baseInstance is for the former, not the latter.[/quote]

That’s backwards; baseInstance works with the latter, not the former. Though with shader draw parameters/GL 4.6, you get access to gl_BaseInstance, which lets you use it with regular instancing.

Also, instance arrays are invoked through instanced draw calls, so it’s not exactly an either-or situation.

Thanks for the correction! Wow – Digging back, I haven’t yet figured out where I got the misinformation that vertex divisors could be used with non-instanced draw calls.

After re-reading things, I think this corrects my mis-information above:

[HR][/HR]
GPU geometry instancing makes use of Instanced Draw Calls (glDrawInstanced + indirect versions of those which set “instanceCount”). With these, there are two main ways to obtain per-instance data in the shader:

[ol]
[li]By using Instanced Arrays (i.e. non-zero vertex attribute divisors) to “push” per-instance data into the shader via vertex attributes, or [/li][li]By using gl_InstanceID to “pull” per-instance data into the shader via shader lookups. [/li][/ol]
The former always makes use of “baseInstance” to perform the per-instance attribute indexing, where as the latter may or may not make use of it (via optional reference to “gl_BaseInstance” in the shader).

In cases where “baseInstance” is not used for per-instance data addressing, it may be used for other purposes.

In 3.1 and 3.2, only the latter option was available. glDrawArraysInstanced() and glDrawElementsInstanced() were added in 3.1 but glVertexAttribDivisor() wasn’t added until 3.3.

Instanced rendering with divisors effectively forms a Cartesian product of the two sets of attributes, allowing you to generate M*N vertices with only O(M+N) data. The “base” versions of the function calls allow you to specify a start index for one or both sets of data (the end index can always be set via the count and/or primcount parameters).

Is there a chance someone could address my last question - does it matter in terms of performance if I instance everything that’s drawn more than once? Indirect drawing gives us interesting opportunity here, because with properly sorted geometry, if we batch draws that share shader/vertex buffers etc. we’re just looking at how many unique meshes we deal with. If I have several meshes A, and several meshes B - would that make sense to instead of putting each one as separate draw entry in indirect buffer, to just bump instanceCount of a specific draw (so there would be one draw entry per unique mesh, with N instances). I heard a lot of contradicting theories regarding instancing but it was in “ye old days” and instancing with indirect draws seem a bit different. I mean, it’s basically given - we fetch per-mesh data from SSBO or similar, we do not change textures (bindless handles), so all remains is just compacting all draws into a single unique entry with instances, right?

Unless there is something that makes it non-performant with smaller numbers like 1-10? 10-50? up to 100?

Assuming instancing is appropriate to your use case (ie: you aren’t trying to access different texture objects from different instances in the same draw call), I don’t see that it would matter terribly much. If instancing isn’t an obvious idea for you (that is, you’re not rendering hundreds of the same mesh), I highly doubt that packaging 4 draws into the same draw call is going to give you all that much of a performance boost relative to having 4 separate draws in the indirect call.

Basically, by using multi-draw indirect and reducing state changes as you have done, all of the low-hanging fruit has been picked. When it comes to performance, there are probably more substantive things you can be looking towards.