Performance problem with a small voxel engine

lorant · November 11, 2011, 11:02am

Hi all,

I’m working on a small voxel engine. To check if performance is correct, I currently render a scene somewhat similar to Minecraft. My problem is, I think my framerate is lower than what it should be, when compared with what the game achieve on the same computer.

My scene is divided into 16 * 16 chunks, and each chunk contains 16 * 128 * 16 voxels. AFAIK this is the same setup than Minecraft.

I also use the same technique: I construct a mesh for each chunk, discarding all the quads that are between adjacent non-empty cubes. There is no occlusion culling, nor frustum culling (those are planned, but first I want to get the base of the engine right).

I use one VAO and two VBOs for each chunk; one contains 4 vertices per quad (12 floats), the other contains 4 (identical) one-byte color index. The actual color is fetched in the fragment shader from a 1D texture.

There is only one big index buffer, used for all VBOs.

All of these buffers are constructed only once.

Here’s my render code:


glUseProgram(program);

for(int i = 0; i < world_width; ++i){
    for(int j = 0; j < world_depth; ++j){
        mat4 mvp = chunks[i][j]->get_model_matrix() * view_projection_matrix;
        glUniformMatrix4fv(uniform_mvp_matrix, 1, GL_FALSE, mvp.c_ptr());

        glBindVertexArray(chunks[i][j]->vertex_array);
        glDrawElements(GL_TRIANGLES, chunks[i][j]->nb_quads * 6, GL_UNSIGNED_INT, 0);
    }
}

As for the content of the scene, it’s generated from a simple Perlin noise, so that a large proportion of cubes are adjacent. The size of the scene take into account the fact that I don’t have frustum culling (in Minecraft on “far” setting there is 33*33 chunks around the player).

I made a short video to give you a better idea of what I’m rendering:

[http://youtu.be/jij3T3rIoDg](http://youtu.be/jij3T3rIoDg)

The framerate is barely 40 fps (on a radeon 4850), where as in Minecraft I get at least 60 fps (with occlusion culling deactivated), often more. Also, I think the scenes in the game are more complex (i.e. smaller homogeneous zones). I’ve seen other videos of similar coding attempts, with much better framerate.

So I think there’s a problem with my implementation, but I have a hard time figuring what. It’s not the shaders, since they are very basic (no lighting), and changing them to even more basic don’t improve anything.

Any idea on how I could find out the cause of the low framerate?

_arts · November 11, 2011, 12:26pm

You can’t expect fast rendering without frustum culling neither occlusion queries. This is where you can get the major improvement simply because the first one allows you to render only what’s inside the frustum (so let’s say about 4 times less than without it), and occlusion queries allow to easily and quickly remove occluded geometry.

Also ensure your algorithm for culling hidden faces is very fast. Since you do a minecraft-like rendering, then occlusion query might not be very needed for you, depending on the algorithm you have.

Finally, use VBO with index arrays.

lorant · November 11, 2011, 12:56pm

Thanks for the answer.

I will of course add frustum culling, but I don’t think it’s the problem here, since the scene is about the size of what is inside the view frustum in Minecraft (probably smaller, in fact). As for occlusion, as you said the benefit is much less than in usual situations (and in my tests I deactivated occlusion in the game).

My algorithm for discarding hidden (i.e. inside solid) faces is only run once, when the buffers are created, not at rendering time; so it should not affect framerate.

I do use indexed VBOs, with the additional “trick” that it’s the same index buffer for all VBOs (since they all contain only quads, all constructed in the same order).

I plan to try various optimizations, including the use of a geometry shader. But I’d like first to understand why my framerate is slower than Minecraft, when AFAIK the technique is the same.

_arts · November 12, 2011, 1:10am

Do you render quads ? If so, it’s a very bad idea. Most of the time they are not supported by the hardware. Use triangles instead.

Second. From what you said with the trick of you index buffer, I think you make drawing call for every faces (or every voxel) ? Try to reduce the number of calls. You can also have a look at MultiDrawElements functions.

I didn’t understood what you said about the frustum. Do you mean you always see the full ‘world’ ?

Finally, have a look at instancing. I’m not very sure (since you’ll have to instance few polygons), but it might help you too. (maybe someone here will disagree with this).

lorant · November 12, 2011, 1:49am

My primitive are triangles. When I say quad I mean a face of a cube-voxel, sorry for the confusion.

I have only one draw call for each chunk; this is the core of the technique used by Minecraft: constructing a single mesh for chunks of 1612816 voxels, discarding invisible ones.

The “trick” I refer to is that, since all my VBOs only contains long list of faces, which all have 4 vertices in the same order, I can use one global IBO for all of them.

In my test the full world is always rendered, but it’s a quarter the size of the scenes in Minecraft, so that should be about what is left after frustum culling, or even smaller.

As for instancing, others have tried it and for this particular case it seems to hurt performance. The reason is that a lot of faces can be discarded (when they are between adjacent cubes), and with cube instancing you can’t do that (except may be in the vertex shader). I’m not sure if instancing quads instead of sending indexed vertices would improve the framerate, maybe that’s worth a try. A geometry shader is probably a better option, though.

_arts · November 12, 2011, 3:38am

You can try to have a single VAO/VBO for all your chunks. This will reduce the number of VAO bindings from 256 (16*16) to 1 (if I understood correctly how you manage them). (edit: this should not be negligible).

Also, I haven’t noticed it before, but why do you send an uniform for each chunk ? This can affect performance too. Can’t you have the same modelview matrix for all the scene ? I am actually unsure how you manage it. From what I understood you use this modelview matrix to place your voxels in the world. But since they are static, it’s a bit useless.

OK for the trick you refer too. Actually it only saves memory, it has no impact on the performance.

As for optimization purpose you also can try to use unsigned short instead of unsigned int for the buffer containing indices.

lorant · November 12, 2011, 8:28am

Unfortunately, I can’t have a single VAO/VBO, because at a later stage I will need to load frequently new part of the landscape, and unload old ones. (Plus, voxels may change frequently)

That’s also why I have a model matrix for each chunk. The landscape will be huge, and I want the render coordinates to always be centered around the camera so that there’s no risk of rounding errors. So when I load a new group of chunks, I can shift the position of old ones just by updating the model matrix.

Although now I think about it, I only really need a translation vector, not a full matrix.

I changed to unsigned short of the indices, with no effect on framerate. Will this save some GPU memory, or are they stored as int anyway?

I played a little with the perlin noise, adding harmonics both at higher and lower frequency, and with some settings I can improve the framerate while still retaining a sufficiently detailed scene. Because this kind of renderer is not bound by the number of solid voxels, but by the surface between solid and empty voxels, it’s hard to tell what is a complex scene, and what is not.

Still, some seems to achieve much better performance than me, for scenes that can’t be much simpler than mine (for exemple, here: http://youtu.be/l8w2V3gPC7I ). So I’m still a bit suspicious about my implementation.

Anyway, thanks for the suggestions, I think I will move to the next stage now, and try a geometry shader approach.

lorant · November 12, 2011, 1:16pm

I have a new question…

I’ve implemented a basic geometry shader, which takes a list of points as input, and for each one emits a cube.

Obviously I have poor performance: I need to construct only the visible faces, i.e. discard those that are between adjacent solid voxels.

In order to do that, I need to access my voxel data in the geometry shader, and I wonder what would be the best way.

My voxel data is:

one byte per voxel
voxels are grouped in chunks of 16 * 128 * 16
the scene contains 16 * 16 of these chunks

I can’t have just one big 3D texture, because chunks will be loaded/unloaded on a regular basis.

Texture arrays seem to be limited to 2D textures.

The only option I see is to create one texture object for each chunk, and to bind the correct one before each call to glDraw. That’s 256 texture bind per frame, I’m not sure if this will hurt performance or not.

Is there any other way?

_arts · November 13, 2011, 2:01am

Unfortunately, I can’t have a single VAO/VBO

You really should keep few VBO. Some people here would even say to have a single VBO. You can quite easily have a single or few VBO and update sub-parts of it/them.

I changed to unsigned short of the indices, with no effect on framerate. Will this save some GPU memory, or are they stored as int anyway?

This saves GPU memory since short are twice smaller than full integers. This can make the rendering slightly faster too, depending on situations.

The landscape will be huge, and I want the render coordinates to always be centered around the camera so that there’s no risk of rounding errors.

I don’t understand how you see things here. Commonly, we place static geometry (I consider it static as it doesn’t move, it can only be destroyed) statically, meaning we affect them global positions, in world coordinates. This will always be faster than having to do transformations on all the geometry (with T&L or uploading new vertex positions in the buffer). If you have to transform each voxel, then it will have a performance drop inevitably.

I played a little with the perlin noise […] I can improve the framerate while still retaining a sufficiently detailed scene. Because this kind of renderer is not bound by the number of solid voxels, but by the surface between solid and empty voxels, it’s hard to tell what is a complex scene, and what is not.

How much was the improvement ? It seems to me, depending on the function, you have a scene with more or less overdrawings.

The way you do things makes that:

you need to bind VBO too much
you need to send uniforms too much, and add additional and needless transforms to each chunk.
the index array is quite useless since each of your chunks don’t share physical vertices. If you use a global VBO with physical vertices and a global buffer for indicies, you can gain significant improvment here too.
you have more or less occlusions depending on the function you use to create your scene. Managing occlusions is not a useless step. Depending on your algorithm, you might be interested in OpenGL occlusion queries. Just removing adjacent faces might be not enough if the scene has lot of holes (parts where some cubes have no face adjacencies but faces hidden by other cubes).

Also remember that when one looks at a cube, she can only see at more 3 of its faces. So for each of the cubes you render, at least half of the faces will be hidden by the view. You can gain a lot of improvement here too.

Without improving all of these points, you can’t expect to have a fast enough renderer.

For about geometry shader, people generally avoid it because it is not very efficient. But I’m not good enough in this area to talk more about this.

lorant · November 13, 2011, 3:44am

The problem is, the VBOs for different chunks have very different sizes. And I need to be able to replace old chunks with new ones when the camera moves. With a single VBO, I’ll have to allocate the maximum possible size for each sub-part, which is huge (almost 400,000 vertices) and most of the time useless (but occasionnaly needed).

I wanted the gpu to work in camera space, because the world will be potentially huge, and in some cases I think that would introduce rounding errors. I need the world coordinates to be at least 5 digits, so if I’m not mistaken that would leave only 2 digits of precision for the gpu, which is probably not enough?

I could achieve almost 60 fps. The current scene is only a placeholder, which tries to be similar to Minecraft terrain in order to compare performance. But “similar” is difficult to evaluate for this kind of renderer, so I in doubt I try to choose what looks like the worst case.

I don’t understand why I would share more vertices with a global VBO… Currently I only share vertices inside a face, because each voxel needs its own color, and for lighting I’ll need different data for each face. Doing more optimization (sharing between similar voxels) would slow down the chunk updates, which I want to avoid.

In my case this is a double-edged sword, since the scene may be modified at any time so that what is occluded suddenly come become visible. Also, in this kind of game it’s frequent to go to a view point where you can see everything. Minecraft added occlusion culling very late, and mainly for laptops: it runs really fine with occlusion deactivated on desktops. I will add occlusion later, but I don’t want to rely on it.

My goal at this point is to have an engine as good as Minecraft, using the same techniques, so that I can see if I can improve on that.

Also remember that when one looks at a cube, she can only see at more 3 of its faces. So for each of the cubes you render, at least half of the faces will be hidden by the view.

That’s a really good point. I relied on GL_CULL_FACE for that, but doing it myself will reduce the VBOs sizes.

Thanks again for all the suggestions!

EDIT: Actually, I spoke too fast. It can’t reduce VBO size, since I don’t want to update them when the camera move. So the only place I could discard non-facing-camera faces would be in the vertex shader, but GL_CULL_FACE probably does a better job than me.

Or maybe I should sort the faces in my VBO to group them by orientations, and split my glDrawElements calls into six glDrawRangeElements (or maybe glDrawElementsBaseVertex)? Would three draw calls with a sub-part of the buffer be faster than one call with the complete buffer?

aqnuep · November 13, 2011, 11:09am

Well, while actually there is a risk of rounding errors in case you have a big world, but you are rendering only cubes and for axis aligned cubes the single precision floating point numbers should be adequate for a very huge world.

You need to calculate that how large is the biggest possible chunk that can be still renderred accurately with single precision floats. It is definitely not 1612816 in size but much much larger. Maybe your whole world would fit that way.

Alfonse_Reinheart · November 13, 2011, 11:20am

I wanted the gpu to work in camera space, because the world will be potentially huge, and in some cases I think that would introduce rounding errors. I need the world coordinates to be at least 5 digits, so if I’m not mistaken that would leave only 2 digits of precision for the gpu, which is probably not enough?

Each chunk should be in it’s own model space, relative to a local origin. Details can be found here.

aqnuep · November 13, 2011, 11:48am

Yes, I think that’s why he’s replacing the modelToWorld matrix uniform between draw calls, though I’m not convinced that the granularity he uses is justified.

Also, you can use other tricks to batch multiple commands and use arrays in uniform blocks in the vertex shader, though using such techniques may be not an alternative on the target hardware.

Alfonse_Reinheart · November 13, 2011, 12:08pm

To be honest though, all he needs is the right vertex shader.

I need the world coordinates to be at least 5 digits

No, you don’t. Your world is made up of blocks, all of which have the same size. Therefore, your world’s granularity only needs to be in sizes of blocks.

Your block coordinates should be in integers (either signed unnormalized shorts, or just floats that happen to use integer values). (0,0), (1,1), etc. You then apply a scale to them, as part of the initial transformation, to scale them up to the world size you actually need. But because you don’t stop at world-space, your full model-to-camera transform should have all of the precision you need. You go from integer coordinates directly to camera space. Since both the integer coordinates and camera space are close to the camera, you don’t have any precision problems.

So you can use all 7 floating-point digits before precision becomes a concern.

lorant · November 13, 2011, 12:31pm

Yes, that’s what I’m currently doing.

arts suggested I change that to world coordinates in order to avoid setting the model matrix uniform 256 times per frame (if I understood correctly). This would be possible since the chunks never move, but I was afraid of rounding errors. Your link seems to confirm that.

That’s also what I do: my cubes are 1 unit square. My fear of rounding errors was in the case suggested above, storing the buffers in world space instead of model space.

So the GPU doesn’t need some “room” to do its work? That might be an argument in favor of world space, even if this is unorthodox.

I don’t think I can increase the size of chunks, because they will be modified on a regular basis, so I need to limit the size of VBOs updates, and the work needed on the CPU to construct the mesh.

EDIT: I did a quick test, and removing the 256 calls to glUniformMatrix for the model matrix only gives me a few fps: I go from 44-45 to 46-47. So I prefer to keep my buffers in model space, where I don’t have to mind about rounding errors.

danbartlett · November 13, 2011, 3:08pm

If you render the opaque chunks roughly sorted from nearest to furthest away you might gain a bit. You could do this by calculating the distance to each chunk, or by splitting the world up into quadrants (or octants) surrounding the viewpoint + render each quadrant (or octant) with loop variables heading away from the viewpoint, so instead of something like this:


for (int x = 0; x< world_width; x++)
  for (int y = 0; y< world_depth; y++)
    render_chunk(x, y);

you could use:


int viewpoint_chunk_x = clamp(calculate_viewpoint_chunk_x(), 0, world_width);
int viewpoint_chunk_y = clamp(calculate_viewpoint_chunk_y(), 0, world_depth);

for (int x = viewpoint_chunk_x; x< world_width; x++)
  for (int depth = viewpoint_chunk_y; y< world_depth; y++)
    render_chunk(x, y);

for (int x = viewpoint_chunk_x; x< world_width; x++)
  for (int depth = viewpoint_chunk_y - 1; y> 0; y--)
    render_chunk(x, y);

for (int x = viewpoint_chunk_x - 1; x> 0; x--)
  for (int depth = viewpoint_chunk_y; y< world_depth; y++)
    render_chunk(x, y);

for (int x = viewpoint_chunk_x - 1; x > 0; x--)
  for (int depth = viewpoint_chunk_y - 1; y > 0; y--)
    render_chunk(x, y);

While inside the volume, some form of frustum culling would help a lot too.

Also, try running it through a profiler, perhaps a normal profiler + an OpenGL specific one. Perhaps the fragment shader could be an issue too, since this is the “inner loop” and anything that could be moved outside of that could provide a boost.

lorant · November 14, 2011, 12:04am

I had already tried it, it made no difference on performance.

Yes I’ll add it at a later stage, but the current size of the scene is what will be in the view frustum in the end (I will load chunks centered around the camera, with a view distance of at least 16 chunks). The reason I didn’t implement frustum culling yet is because I want to try several different formats for the voxels (i.e. various combinations of arrays and octrees).

I have trouble running gDEBugger on my 64 bit linux, and trouble getting SDL 1.3 + Glew to work on the windows side…

I’ll probably switch to SDL 1.2.

If I replace my (already simple) fragment shader with a completely straightforward one (with only one assignment), I get exactly the same framerate. If I do the same thing with the vertex shader, I get a very small gain (from 50 fps to 51 or 52). So I guess that means I’m CPU bound, but my render functions are really simple. For each 256 chunk I do a matrix product, set the model matrix uniform, bind the VAO and call glDrawElements.

_arts · November 14, 2011, 1:23am

Remove your cpu matrix multiplication and try the same test without uniforms. Can be a good start to know if you’re CPU limited. But actually I don’t beleive you are: you have a slighter better graphic card than mine, so I expect you have at least the same kind of CPU (here I have an Athlon X2).

Do you currently update your VBO once before rendering, or do you do this regularly ?

arts suggested I change that to world coordinates in order to avoid setting the model matrix uniform 256 times per frame (if I understood correctly). This would be possible since the chunks never move, but I was afraid of rounding errors. Your link seems to confirm that.

You can have a precision of 10e-3 with a world size of (20000-1)*(20000-1), which looks more than far enough to me for a minecraft game (it can make a world of 20km wide with a precision of 1 millimeter).

You can improve a bit more, with no uniforms, a single vbo (so single calls to glVertexAttrib*) and fewer calls to DrawElements, and uses more clever indices. And as Dan Barlett suggested, draw from front to back. But you’re right, that’s only about optimization. All what you could gain at the end might be of the order of 3-5%

Also, make a try with instancing.

And as a side-note, since your shaders are really simple, expect a drop when you’ll add lighting (mainly per-pixel lighting), at least if you’re not CPU limited.

PS: can you keep the same world to make the tests ? I see that you often have different results (started with 40 fps and now you’re about 50 fps), I guessed due to change in your perlin noise functions ? This absolutely won’t help.

lorant · November 14, 2011, 2:28am

Sorry, I forgot to mention, since my first post I removed the quads at the edge of the world, which is why I am now at 50fps, with the same noise function and settings.

I tried that, but I gain only a few fps (from 50 to 51 or 52 with the original noise function). That’s why I prefer to stay with model space coordinates and keep the model matrix for each chunk (also, see below).

20km could be enough for me, but Minecraft goes beyond that. Also, I’ve found in this post from Notch that he uses local (camera space) coordinates for rendering, so that does not explain the difference I have in framerate.

All my VBOs are constructed and uploaded only once.

Ah yes, I completely forgot about interleaved buffers… That might very well be the source of my problem. I’ll test that.

What do you mean by “more clever indices”?

From I’ve read about others’ attempts, that would only hurt performance in my case. This kind of renderer relies on the fact that a huge proportion of faces are discarded, so I can’t instance cubes, and AFAIK instancing quads is pointless.

That’s why I try very hard to get the base engine right! ^ ^

Also, I’ll decide what kind of lighting I’ll implement depending on the performance I can get. In a voxel world, I think there’s a lot of way you can “cheat” about the lighting and still get decent looking and dynamic results.

lorant · November 14, 2011, 3:34am

Yay!

Moving to interleaved buffers (32 bits aligned) made the framerate jump from 50 fps to more than 130… I feel silly now, I should have thought of that. But I had no idea it would have so much impact!

I’ll probably play a little with the geometry shader approach to see if I can improve on that, but that’s already good enough.

Thanks everyone for your help!