Setting up Batch Rendering with Scene Graph

Hi.

I am currently constructing a basic 3D game engine using OpenGL, and would like some advice on a bit of an obscure facet that I found very little resources for online.

I’m setting up a batch rendering system that renders groups of meshes in a single draw call based on the material that is applied to them (a material is just an abstraction that maintains a shader program and any texture maps that may need to be inputted into the shader).

In addition to this, I’ll be setting up a scene graph which represents individual objects in the world with child/parent relationships. Each node in the heirarchy will have its own matrix that represents its transform relative to its parent. Now, I’d like the scene graph to work with the batch renderer to ensure that individual models (that will be associated with some of these nodes in the heirarchy) are rendered in the correct position in the world; bearing in mind some of these individual models will probably share a common material so we’d like to group these into a batch.

Now this is where I hit a brick wall…

Although less efficient, the idea of using individual VAOs to render meshes for each object seems more intuitive, since I’d imagine you could swap in the model matrix for that object, issue a draw call and then swap it out.

With batching involved though, there are a few issues (in my mind):

First of all, the batch renderer knows nothing about how vertices (that it is processing) are grouped into objects. As far as it’s concerned, they exist only as a linear buffer of numbers.

Secondly, even if it did know this information, there isn’t any way to swap in and out different matrices as it draws, since all vertices are drawn in one draw call.

I’ve been thinking about this for a while, but cannot seem to figure it out. I don’t know whether or not I’m being extremely dumb or something (forgive if I am).

Anyone who could provide me a better insight into what I’ve discussed here would be much appreciated.

Bear in mind that you don’t need a separate VAO for each object even if you use a separate draw call. A draw call doesn’t have to use all of the vertices (or all of the indices in an element array). You only really need to use different VAOs if the objects have different structure (different number or types of attributes). So long as each object has the same structure, you can store them all in a single VAO (and associated attribute and element arrays) with each object comprising a separate region of the element array. With OpenGL 2.0 and later, you can use glMultiDrawElements() to draw multiple ranges in a single draw call.

Add an integer vertex attribute holding an object ID. This can be used to index into a uniform array holding per-object data such as a transformation matrix. You can do something similar for materials, avoiding the need to use a separate draw call for each material. Obviously, objects using substantially different shaders need to use separate draw calls, but you don’t need to separate objects which differ only in terms of material parameters (colours, specular exponent, textures, etc).

Nice catch on what I said about the VAOs. You’re right. I was meant to say draw calls, but said VAOs instead (oh dear). I’ll be sure to triple check what I’m saying next time. Apologies for that.

Your answer to my main question makes a lot of sense though, and has helped to push my thought process in the right direction.

Thanks

[QUOTE=Acacia Tree;1293324]Now this is where I hit a brick wall…

Although less efficient, the idea of using individual [strike]VAOs[/strike] draw calls to render meshes for each object seems more intuitive, since I’d imagine you could swap in the model matrix for that object, issue a draw call and then swap it out.

With batching involved though…[/QUOTE]

Use Indirect rendering. Specifically, glMultiDrawArraysIndirect or glMultiDrawElementsIndirect.

Pack up your individual draw calls into a single multi-draw indirect (MDI) struct array, one struct per draw call, and upload it to the GPU. Also upload your matrix array into some other uniform state accessible to your shader (e.g. ordinary uniform array, buffer object, texture, etc.)

Then in the vertex shader, use gl_DrawID to fetch the correct matrix for that draw and use it to transform the vertices.

This allows you to batch many of those tiny, inefficient draw calls into one big happy draw call that may consume less CPU.

For added performance, pre-group objects with shared state (e.g. materials) in shared scene graph subgraphs, and sub-group into nodes under that by spatial locality. Then there’s less CPU bashing required to frustum cull your scene graph and get it ready to draw on the GPU.

I’ve been thinking about this for a while, but cannot seem to figure it out. I don’t know whether or not I’m being extremely dumb or something (forgive if I am).

No, you’re not. Don’t worry about it. If you don’t already know that this GL functionality exists, it’s hard to apply it.

Note that gl_DrawID requires GLSL 4.60 or the ARB_shader_draw_parameters extension. I don’t know how widespread support for the extension is, but requiring OpenGL 4.6 just for that seems excessive.

[QUOTE=GClements;1293347]Note that gl_DrawID requires GLSL 4.60 or the ARB_shader_draw_parameters extension.
I don’t know how widespread support for the extension is, but requiring OpenGL 4.6 just for that seems excessive.[/QUOTE]

Up to the OP, but it looks like ARB_shader_draw_parameters has pretty good coverage at this point:

[ul]
[li]gpuinfo.org: Extension Search [/li][li]gpuinfo.org: Reports supporting ARB_shader_draw_parameters [/li][/ul]
gpuinfo.org reports 53.26% coverage based for this extension based on the reports in the DB. Websearching extension release dates, it appears that it’s been out there for > 4 years.

Also, this could be added as a performance improvement for newer GPUs rather than a requirement (falling back to the old “for i in 1…num_batches: draw()” implementation if it’s not supported).

There’s no need to split draw calls. If you don’t have gl_DrawID (which, incidentally, isn’t listed in the online reference pages yet), you can use an integer attribute instead. That assumes that you aren’t sharing vertices between “objects”, which is probably a reasonable assumption.

Even if you do have gl_DrawID, using an integer attribute for material index may be worthwhile if you have more objects than materials.

Good point. I was just about to post about this. On slide 31-33 here (Approaching zero driver overhead - 3/2014), they suggest a few gl_DrawID alternatives and refer to this G-Truc post (17/11/2012 - Surviving without gl_DrawID).

Thanks for your input. I’ll look into gl_DrawID and these functions for multiple draw calls a bit later on, although I think I’ve grasped the basic idea why you’d want to use them.

While we’re on the topic, I was wondering what would be the most suitable method for efficiently uploading this sort of data to the GPU.

At the very least, each object will be sending one 4x4 matrix, so the number of bytes being sent to the GPU will begin to build up considerably as more and more objects are added. I’m unsure whether or not there’s really a recommended amount of data that you should stick to when sending stuff between the CPU and GPU?

I would think that a good idea would be to implement a fixed-size array for these matrices on the GPU side since to my knowledge, uniform arrays are non-dynamic (possibly a newer version of OpenGL has enabled dynamic arrays though), and buffer objects are immutable, so to keep extending them would mean deleting and reallocating the memory all over again with one extra element which probably isn’t something that you’d want to do in this situation. In this case, I’m assuming a fixed-sized array would be the best option, although it would impose a restriction on the number of objects that can be pushed through per-batch.

One optimisation idea I had thought of is only updating an object’s matrix on the GPU if it has changed in any sort of way on the CPU side, but that’s assuming that the multiplication by the projection and view matrices have not already occured at this point, since I would assume that these matrices should be updated independant of this process. In this case, it’d probably be a good idea to multiply the view and projection matrices together on their own, send the resultant matrix into the shader program via its own uniform variable and then combine this with the appropriate object matrix once in the vertex shader.

Would appreciate any further advice or thoughts on these ideas.

If you have a small number of draws with a small amount of data per draw, consider an ordinary uniform array.

If not, read this: Buffer Object Streaming and consider use of UBOs, SSBOs, TBOs, etc.

The data transfer for the matrices is going to be negligible. The main thing is to ensure that any updates don’t cause synchronisation (see the “buffer object streaming” link in an earlier post). For small amounts of data, a default-block uniform (glUniformMatrix* functions) will be fine. If you exceed the size of the default uniform block, there are UBOs (3.0+) and SSBOs (4.3+); in earlier versions, it was common to use textures for large uniform arrays.

The standard technique for dynamically resizing arrays is to double the size of the memory block whenever it becomes too small. This is how C++ handles std::vector and std::string, for example. But you could probably just allocate enough memory for the worst case; that’s still going to be small compared to the amount of data used for vertex attributes and textures.

You probably want one projection matrix, one view matrix, and an array of model (object) matrices. If you’re using a perspective projection, you need to keep the projection matrix separate because lighting calculations require a space that’s affine to “world” space, and you typically want to use the same space for all objects (otherwise you have to transform the eye and light positions into each object’s space). This is why the fixed-function pipeline has separate model-view and projection matrices.

So the vertex shader would do something like:


uniform mat4 projection;
uniform mat4 view;
uniform mat4 model[MAX_OBJECTS];
in vec4 position;
in vec3 normal;
in int id;
out vec4 posn; // eye space
out vec3 norm; // eye space
void main()
{
    mat4 mv = view * model[id];
    posn = mv * position;
    norm = mat3(mv) * normal;
    gl_Position = projection * posn;
}

You could combine the model matrices with the view matrix in the application, but that becomes a nuisance if you need multiple views (e.g. for shadow maps, environment maps, etc), as you’d need a separate model-view array for each view matrix.

The performance cost of an extra matrix multiply per vertex is likely to be negligible unless the vertex counts are extremely high. Typically, the fragment shader computations are a far more significant factor than the vertex shader computations due to the (usually) far greater number of invocations.