Optimising Many glDrawArraysInstanced Calls

Hi all,

I have a voxel map that is rendered as chunks of 32x32x32 voxels, each of which have their own VAO, base triangle strip VBO, and instance data VBO:

// Create one VAO for this chunk
uint vaoHandle = Gl.GenVertexArray();
Gl.BindVertexArray(vaoHandle);


// Create the base triangle strip VBO
uint vboHandle = Gl.GenBuffer();
Gl.BindBuffer(BufferTarget.ArrayBuffer, vboHandle);
Gl.EnableVertexAttribArray(0);
Gl.VertexAttribIPointer(0, 1, VertexAttribType.Int, 4, IntPtr.Zero);


// Triangle strip data, which represents a 1x1 voxel face
int* data = (int*)Helper.Alloc(4 * sizeof(int));
data[0] = 0;             // 0, 0
data[1] = 1;             // 1, 0
data[2] = 1 << 8;        // 0, 1
data[3] = 1 | (1 << 8);  // 1, 1

Gl.BufferData(BufferTarget.ArrayBuffer, 4, (IntPtr)data, GL_STATIC_DRAW);
Helper.Free(data);

// Advance once per vertex (not once per instance)
Gl.VertexAttribDivisor(0, 0);


// Create the instance VBO
uint instanceVBO = Gl.GenBuffer();
Gl.BindBuffer(BufferTarget.ArrayBuffer, instanceVBO);

int instanceAttrib = 1;
Gl.EnableVertexAttribArray(instanceAttrib );
Gl.VertexAttribIPointer(instanceAttrib, intSize, VertexAttribType.Int, 1 * sizeof(int), IntPtr.Zero);
Gl.VertexAttribDivisor(instanceAttrib, 1); // Advance once per instance

int initialInstanceBytes = 4096 * sizeof(int);
Gl.BufferData(BufferTarget.ArrayBuffer, initialInstanceBytes, IntPtr.Zero, GL_STREAM_DRAW);

Gl.BindVertexArray(0);
Gl.BindBuffer(BufferTarget.ArrayBuffer, 0);

For each face in my voxel mesh, I write one value to the instanceVBO, which is a 32 bit integer packed with:

5 bits for X
5 bits for Y
5 bits for Z
3 bits for normal (Y+, Y-, X+, X-, Z+, Z-)
5 bits for length (adjacent voxel faces are combined)
6 bits for textureID
3 bits for health

These XYZ coordinates are chunk-relative (range 0-31) and are rotated to the correct face in the shader using normal.

I then frustum cull my chunks and store the vaoHandle and faceCount of each chunk in an array, which is the rendered in this function:

void RenderMap(int amount, uint* vao, int* faceCounts)
{
    uint* end = vao + amount;
    while (vao < end)
    {
        Gl.BindVertexArray(*vao++);
        Gl.DrawArraysInstanced(PrimitiveType.TriangleStrip, 0, 4, *faceCounts++);
    }
}

Since each mesh only contains chunk-relative XYZ coordinates, each chunk would be rendered in the same spot. I used to set a WorldPos uniform before each Gl.DrawArraysInstanced call, but I instead attached another worldPosVBO instance data VBO to the VAO for each chunk:

// World position VBO, 3 floats
uint worldPosVBO = Gl.GenBuffer();
int worldPosAttrib = 2;

Gl.BindBuffer(BufferTarget.ArrayBuffer, worldPosVBO);
Gl.EnableVertexAttribArray(worldPosAttrib);
Gl.VertexAttribPointer(worldPosAttrib, 3, VertexAttribType.Float, false, 3 * sizeof(float), IntPtr.Zero);

Gl.BufferData(BufferTarget.ArrayBuffer, worldPosBytes, (IntPtr)worldPosData, GL_STATIC_DRAW);

I then set the divisor of the worldPosVBO to the amount of voxel faces written to instanceVBO:

Gl.BindVertexArray(vaoHandle);
Gl.VertexAttribDivisor(2, (uint)voxelFaceCount);
Gl.BindVertexArray(0);

The vertex shader then decodes the packed voxel data, aligns and resizes it based on normal and length, then converts it to world space:

layout (location = 0) in int aBasePosition;
layout (location = 1) in int aData; // 32 bits of packed voxel data
layout (location = 2) in vec3 worldPos;

uniform mat4 mvp;

// Normals: [Y+, Y-, X+, X-, Z+, Z-],
// i.e. `normal == 0` is the equivalent of `normal == Y+`

void main()
{
    // Decode bits
    int normal = int((aData >> 15)&(7));
    float length = float((aData >> 18)&(31));


    // Voxel-relative pos, initially flat (face facing up)
    vec3 netPos = vec3(float(aBasePosition&(255)), 0, float((aBasePosition >> 8)&(255)));


    // Move to the positive side of the voxel cube (i.e. lift top face up one)
    netPos.y += (normal == 0 || normal == 2 || normal == 4) ? 1 : 0;
        
    // Flip winding on YN, XP, ZN so cull face works
    netPos.z = (normal == 1 || normal == 2 || normal == 5) ? (1 - netPos.z) : netPos.z;

    // Increase length
    netPos.x += netPos.x == 1 ? length : 0;
        
    // Align to the x axis
    netPos.xy = (normal == 2 || normal == 3) ? netPos.yx : netPos.xy;    
        
    // Align to the y axis
    netPos.xyz = (normal ==  4 || normal == 5) ? netPos.zxy : netPos.xyz;


    // Voxel-relative to chunk-relative
    netPos.x += float(aData&(31));
    netPos.y += float((aData >> 5)&(31));
    netPos.z += float((aData >> 10)&(31));


    // Chunk-relative to world-relative
    vec3 position = netPos + worldPos;


    // World space to screen space
    gl_Position = mvp * vec4(position, 1.0);
}

This approach turned out to be around 1.9x faster than the old approach, which involved creating a GL_TRIANGLE VBO for each chunk and positioning it using a uniform:

void RenderMap(int amount, uint* vao, Vector3F* worldPos, int* faceCounts)
{
    uint* end = vao + amount;
    while (vao < end)
    {
        Gl.Uniform3(location, *worldPos++); // Set the chunk's world position
        Gl.BindVertexArray(*vao++);
        Gl.DrawArrays(PrimitiveType.Triangles, 0, *faceCounts++);
    }
}

My guess on why this runs faster:

  • The vertex shader only runs 4 times per face now with a GL_TRIANGLE_STRIP, versus 6 using GL_TRIANGLES
  • The VBO contains 6x less data - before I was writing 6 vertices per voxel face to the VBO two construct two triangles. Now I just write one value to the VBO for each instance.
    • Less memory = faster?
  • Uniforms no longer need to be set between each draw call

The Question

Around 30% of frame time on the CPU is spent in these Gl.BindVertexArray and Gl.DrawArraysInstanced calls. I looked at using glDrawArraysIndirect to reduce this, however it doesn’t support setting a different VAO for each call.

Is there another glDraw*Indirect or glMultiDraw* approach I can take, or should I store the data for all of my chunks in one large VBO and render it using Gl.DrawArraysIndirect?

The latter. This works great. And ensures no state changes between instances.

If you’re CPU/submission limited, just getting rid of all those buffer binds is likely to net you a nice perf improvement. But for even more…

You probably would anyway, but be sure you use the instanced subdraw feature in MultiDraw Indirect (MDI) – see instanceCount and baseInstance here:

That is, one subdraw record per model (all instances), not one subdraw record per model instance.

Why? On some GPUs/drivers such as NVIDIAs, the vertex work is packed more efficiently with instanced rendering, particularly for models with few verts (“small models”), yielding higher GPU utilization and faster drawing.

For way more info on this than you’re interested in, see:

1 Like

To your larger question here, you probably know this, but the key is identifying what your primary bottleneck is. Once you know that, then you know what you can do about it. You can run iterative perf tests to help pin that down. Or use good CPU/GPU profiling tools to help point the way.

Fragment/fill bound? Rework shading to reduce that. Vertex bound? Rework submission to reduce vertex count and/or cost-per-vertex. CPU/submission bound? Rework submission to cut back on the CPU-side “prep” work (e.g. state changes, draw call count, etc.)

To your suspicions:

Perhaps. If your vertex bound. Note however that with a typical mesh, you can achieve “fewer” vertex shader executions with Indexed TRIANGLEs than you can with non-indexed TRIANGLE_STRIPs with decent triangle order optimization (re vertex transform cache). So it’s not all about the primitive type without qualification.

Generally, yes. Takes time to access. Consumes more space in the cache. More latency to try and hide, which may not be possible with your workload.

That’s a candidate too. Fewer state changes between draw calls generally saves time. Again, it depends on what you’re bound on.

1 Like

Thank you very much, this is super useful.

One thing I am not sure of is how my worldPosVBO will translate into this single-VBO approach, since it needs to increment once per model - not once per instance or once per vertex - and each model has a difference amount of vertices in it. At the moment I work around this by setting its divisor to the amount of vertices in each model.

Would I have to convert this to a uniform array, and access it in the vertex shader using gl_DrawID? Each map can have up to 32678 chunks, meaning the uniform array would get pretty large.

gl_DrawID would work if your subdraws didn’t use instancing. But since they do, use gl_BaseInstance + gl_InstanceID. Use that to index into a UBO or SSBO.

gl_BaseInstance let’s you provide the offset for the first instance within each subdraw.

1 Like

Combining meshes into one VBO, storing world chunk positions in a UBO, and rendering with glMultiDrawArraysIndirect is unbelievably faster.

When using the lowest-quality rendering preset in my game, I’m achieving 880 FPS up from 510. Total frame time on the CPU reduced from 2.3ms to 1.1ms, and time spent on the GPU rendering the map reduced from 0.55ms to 0.23ms (measured using GL queries).

Thank you @Dark_Photon, you are a legend around here!

Glad that worked for you!

1 Like