Hi all,
I have a voxel map that is rendered as chunks of 32x32x32 voxels, each of which have their own VAO, base triangle strip VBO, and instance data VBO:
// Create one VAO for this chunk
uint vaoHandle = Gl.GenVertexArray();
Gl.BindVertexArray(vaoHandle);
// Create the base triangle strip VBO
uint vboHandle = Gl.GenBuffer();
Gl.BindBuffer(BufferTarget.ArrayBuffer, vboHandle);
Gl.EnableVertexAttribArray(0);
Gl.VertexAttribIPointer(0, 1, VertexAttribType.Int, 4, IntPtr.Zero);
// Triangle strip data, which represents a 1x1 voxel face
int* data = (int*)Helper.Alloc(4 * sizeof(int));
data[0] = 0; // 0, 0
data[1] = 1; // 1, 0
data[2] = 1 << 8; // 0, 1
data[3] = 1 | (1 << 8); // 1, 1
Gl.BufferData(BufferTarget.ArrayBuffer, 4, (IntPtr)data, GL_STATIC_DRAW);
Helper.Free(data);
// Advance once per vertex (not once per instance)
Gl.VertexAttribDivisor(0, 0);
// Create the instance VBO
uint instanceVBO = Gl.GenBuffer();
Gl.BindBuffer(BufferTarget.ArrayBuffer, instanceVBO);
int instanceAttrib = 1;
Gl.EnableVertexAttribArray(instanceAttrib );
Gl.VertexAttribIPointer(instanceAttrib, intSize, VertexAttribType.Int, 1 * sizeof(int), IntPtr.Zero);
Gl.VertexAttribDivisor(instanceAttrib, 1); // Advance once per instance
int initialInstanceBytes = 4096 * sizeof(int);
Gl.BufferData(BufferTarget.ArrayBuffer, initialInstanceBytes, IntPtr.Zero, GL_STREAM_DRAW);
Gl.BindVertexArray(0);
Gl.BindBuffer(BufferTarget.ArrayBuffer, 0);
For each face in my voxel mesh, I write one value to the instanceVBO
, which is a 32 bit integer packed with:
5 bits for X
5 bits for Y
5 bits for Z
3 bits for normal (Y+, Y-, X+, X-, Z+, Z-)
5 bits for length (adjacent voxel faces are combined)
6 bits for textureID
3 bits for health
These XYZ coordinates are chunk-relative (range 0-31) and are rotated to the correct face in the shader using normal
.
I then frustum cull my chunks and store the vaoHandle and faceCount of each chunk in an array, which is the rendered in this function:
void RenderMap(int amount, uint* vao, int* faceCounts)
{
uint* end = vao + amount;
while (vao < end)
{
Gl.BindVertexArray(*vao++);
Gl.DrawArraysInstanced(PrimitiveType.TriangleStrip, 0, 4, *faceCounts++);
}
}
Since each mesh only contains chunk-relative XYZ coordinates, each chunk would be rendered in the same spot. I used to set a WorldPos
uniform before each Gl.DrawArraysInstanced
call, but I instead attached another worldPosVBO
instance data VBO to the VAO for each chunk:
// World position VBO, 3 floats
uint worldPosVBO = Gl.GenBuffer();
int worldPosAttrib = 2;
Gl.BindBuffer(BufferTarget.ArrayBuffer, worldPosVBO);
Gl.EnableVertexAttribArray(worldPosAttrib);
Gl.VertexAttribPointer(worldPosAttrib, 3, VertexAttribType.Float, false, 3 * sizeof(float), IntPtr.Zero);
Gl.BufferData(BufferTarget.ArrayBuffer, worldPosBytes, (IntPtr)worldPosData, GL_STATIC_DRAW);
I then set the divisor of the worldPosVBO
to the amount of voxel faces written to instanceVBO
:
Gl.BindVertexArray(vaoHandle);
Gl.VertexAttribDivisor(2, (uint)voxelFaceCount);
Gl.BindVertexArray(0);
The vertex shader then decodes the packed voxel data, aligns and resizes it based on normal
and length
, then converts it to world space:
layout (location = 0) in int aBasePosition;
layout (location = 1) in int aData; // 32 bits of packed voxel data
layout (location = 2) in vec3 worldPos;
uniform mat4 mvp;
// Normals: [Y+, Y-, X+, X-, Z+, Z-],
// i.e. `normal == 0` is the equivalent of `normal == Y+`
void main()
{
// Decode bits
int normal = int((aData >> 15)&(7));
float length = float((aData >> 18)&(31));
// Voxel-relative pos, initially flat (face facing up)
vec3 netPos = vec3(float(aBasePosition&(255)), 0, float((aBasePosition >> 8)&(255)));
// Move to the positive side of the voxel cube (i.e. lift top face up one)
netPos.y += (normal == 0 || normal == 2 || normal == 4) ? 1 : 0;
// Flip winding on YN, XP, ZN so cull face works
netPos.z = (normal == 1 || normal == 2 || normal == 5) ? (1 - netPos.z) : netPos.z;
// Increase length
netPos.x += netPos.x == 1 ? length : 0;
// Align to the x axis
netPos.xy = (normal == 2 || normal == 3) ? netPos.yx : netPos.xy;
// Align to the y axis
netPos.xyz = (normal == 4 || normal == 5) ? netPos.zxy : netPos.xyz;
// Voxel-relative to chunk-relative
netPos.x += float(aData&(31));
netPos.y += float((aData >> 5)&(31));
netPos.z += float((aData >> 10)&(31));
// Chunk-relative to world-relative
vec3 position = netPos + worldPos;
// World space to screen space
gl_Position = mvp * vec4(position, 1.0);
}
This approach turned out to be around 1.9x faster than the old approach, which involved creating a GL_TRIANGLE
VBO for each chunk and positioning it using a uniform:
void RenderMap(int amount, uint* vao, Vector3F* worldPos, int* faceCounts)
{
uint* end = vao + amount;
while (vao < end)
{
Gl.Uniform3(location, *worldPos++); // Set the chunk's world position
Gl.BindVertexArray(*vao++);
Gl.DrawArrays(PrimitiveType.Triangles, 0, *faceCounts++);
}
}
My guess on why this runs faster:
- The vertex shader only runs 4 times per face now with a
GL_TRIANGLE_STRIP
, versus 6 usingGL_TRIANGLES
- The VBO contains 6x less data - before I was writing 6 vertices per voxel face to the VBO two construct two triangles. Now I just write one value to the VBO for each instance.
- Less memory = faster?
- Uniforms no longer need to be set between each draw call
The Question
Around 30% of frame time on the CPU is spent in these Gl.BindVertexArray
and Gl.DrawArraysInstanced
calls. I looked at using glDrawArraysIndirect
to reduce this, however it doesn’t support setting a different VAO for each call.
Is there another glDraw*Indirect
or glMultiDraw*
approach I can take, or should I store the data for all of my chunks in one large VBO and render it using Gl.DrawArraysIndirect
?