Performance problem with a small voxel engine

Good to hear you got it sorted.

Out of interest, does having the one-byte color index values aligned on 4-byte boundaries but still in a separate buffer object give any improvement over having them tightly packed? Although interleaved is usually better if possible.

Have you remembered to switch back-face culling on too?(never mind already seen you mention this in another post)

Maybe now the major bottleneck is gone, the other optimizations such as rendering front to back will have an effect, although that one would have more effect if rendering fragments is expensive.

If matrix multiplication becomes a bottleneck (although 256 matrix multiplications shouldn’t be), then one trick you could do if the chunks are always aligned with the world is to simplify the matrix multiplication.

If you have a constant view-projection matrix across all chunks:


[a e i m]
[b f j n]
[c g k o]
[d h l p]

And you are multiplying it by the chunk model matrix that is simply a translation from the origin:


[1 0 0 x]
[0 1 0 y]
[0 0 1 z]
[0 0 0 1]

The the matrix multiplication could be simplified to:


[a e i m][1 0 0 x]   [a e i ax+ey+iz+m]
[b f j n][0 1 0 y] = [b f j bx+fy+jz+n]
[c g k o][0 0 1 z]   [c g k cx+gy+kz+o]
[d h l p][0 0 0 1]   [d h l dx+hy+lz+p]

Only the last column varies across each of the chunks.

Does removing the glBindVertexArray call have much of an impact on performance, if it does then putting everything into one buffer object as other people have mentioned might help, using glBufferSubData to stream in new chunks + glDrawRangeElements to draw each visible chunk. If using more recent extensions, then glMapBufferRange to allow writing to a range of the buffer + glDrawElementsBaseVertex to allow use of a shared index buffer could be useful.

What exactly do you mean with interleaved buffers ?

You’re right! I just replaced the one byte color index with a 32 bit integer, and also get a framerate of 130 fps. So it was alignment the problem.

The problem is different chunks can have very different sizes, and the maximum size is huge. And because there will be frequent chunk loads/unloads, I don’t see how I can put everything in a single buffer without doing some scary memory management to avoid fragmentation.

I will do the matrix optimization, as the CPU will have more work to do in the real application.

To have coordinates and color in the same buffer, “mixed” in this order: coord of vertex 1, color of vertex 1, position of vertex 2, color of vertex 2, and so on.

But as Dan Bartlett suggested, the problem was that my color buffer was using one byte colors. Aligning the color buffer on 32 bits takes more memory, but is way faster.

The good news is I’ll probably find some use to the extra bytes when I’ll implement lighting.

I’ve come across the alignment problem when using 3 bytes for colours (red, green and blue) tightly packed. It’s better to include the extra byte for alpha and use a stride of 4 or fill them all with 255 even if the alpha field is unused.

Looking at http://developer.amd.com/media/gpu_assets/ATI_OpenGL_Programming_and_Optimization_Guide.pdf for data types that take up 2 bytes you’ll also have performance problems if you use 1 or 3 elements, since each will take up 2 or 6 bytes and would therefore need 2 bytes padding to be 32-bit aligned for better performance (at least on ATi hardware, and probably on others too).

That document is very interesting!

However, that just doesn’t match what is going on with my code.

I’ve done some more tests, and the only way to have good performance is to use 32 bits integers for the color index. If I use 32 bits aligned shorts or bytes, performance drops. This happens whether I use two buffers or a single interleaved one (the drop is more important in the latter case).

Which is exactly the opposite of what is stated in the ATI document: that I should avoid using integers in VBOs.

So either I have something horribly wrong in my code, or there’s something I don’t understand correctly in the recommendations.

If I’m not mistaken, the only places I could do something wrong in my code is in the vertex shader, or the buffer format. The vertex shader is really simple:


#version 330 core

in vec3 position;
in int color;

uniform mat4 mvp_matrix;

out int vs_color;
out vec3 vs_position;

void main(void)
{
    gl_Position =  mvp_matrix * vec4(position, 1.0);
    vs_position = position;
    vs_color = color;
}

If a use a color buffer containing 32 bits integer:


vector<int32_t> colors;

glVertexAttribIPointer(color_attribute, 1, GL_INTEGER, sizeof(int32_t), 0);

glBufferData(GL_ARRAY_BUFFER, colors.size() * sizeof(int32_t), 
             &colors[0], GL_STATIC_DRAW);

…then I get good performance (130 fps).

If a use 32 aligned bytes instead:


struct VertexColor {
    int8_t color;
    int8_t pad1;
    int16_t pad2;
};
vector<VertexColor> colors;

glVertexAttribIPointer(color_attribute, 1, GL_BYTE, sizeof(VertexColor), 0);

glBufferData(GL_ARRAY_BUFFER, colors.size() * sizeof(VertexColor), 
             &colors[0], GL_STATIC_DRAW);

…everything is correctly displayed, but performance drops to 50 fps.

The important thing is I know how to have a good framerate, but I’d like to understand what is going on…

However, that just doesn’t match what is going on with my code.

Of course it doesn’t. You are using integral attributes: attributes that are actually integers, rather than those that are converted to floats.

That document is for R300, R400, and R500, none of which could use integral attributes. Though, it should be noted that this is the first time you’ve said you’re using integral attributes.

When I said that you should use integers, I did not mean using glVertexAttribIPointer. I meant using the regular glVertexAttribPointer with integer types and no normalization.

The vertex shader still gets “floats”, but the data in the buffer are “integers”.

Ok, I understand now. Thanks for the explanation. And I’m sorry I forgot to mention using int in the shader…

I switched to integral attributes to be able to use texelFetch in the fragment shader. I should have posted the code, but in my mind it was completely straightforward…

I switched to integral attributes to be able to use texelFetch in the fragment shader.

So you switched to using integral vertex attributes, so your fragment shader could use texelFetch. Couldn’t you just use a cast in the fragment shader?

I just tried with floats and a cast, performance is the same, at least on my ATI.

I prefer to keep integers, because in the future I’ll probably use the 3 unused bytes (using masks) for some lighting data, and I’m afraid that conversions from CPU int, to CPU/GPU float, to GPU int could introduce errors.

(Also, I may move texelFetch to the vertex shader, depending on whether I choose vertex or pixel lighting.)