Performance advise for OpenGL 4+ ?

Dear all,

I’m about to refactor my 3d demo project. It was born about the time OpenGL 1.5 was considered modern. Since then some modern functionality has been added (VBO, floating point textures, FBO and so on, but only punctually).

I know that the answer to performance questions is always “it depends”. So I’m trying to narrow the scope of my question. If you were to target OpenGL 4+ compatible nvidia graphic cards in the scenario I’ll describe below, what would you recommend?

Here’s the context of my demo: A scenegraph holds some objects which are composed of multiple (say 100) meshes (which may have varying matrices, textues and materials). During rendering the meshes are sorted by state. All meshes are stored in one large VBO (interleaved) which is bound once per frame. For each mesh the the vertex attributes are set up (glVertexAttribPointer), shader uniforms are set (glUniform) and textures might be bound (glBindTexture) before the mesh is rendered (glDrawRangeElements).

What are your performance observations with UBO vs. glUniform calls, explicit vs. automatic attribute locations in GLSL, one VAO per scene with one VBO per scene or one VBO per object vs. one VAO per object? What about bindless or direct state access?



A few ideas that are based 3 things (1) minimize GL calls (2) minimize state switching (3) better caching:

Using UBO will be better especially if you have many uniforms per program. You save the multiple glUniform calls. In my project a huge overhead is the setup of the program (program bind & glUniform*).

Another thing is that when you set a uniform then you don’t have to re-set it the next time you bind the program. The value is retained

Another is not to switch state allot, not enable/disable blend, depth, stencil all the time. Not to switch programs all the time.

As for the VAOs I use one per mesh. In your case where you have one big VBO you probably have one bind, the indeces VBO so you should be alright. If you use multiple glVertexAttribPointer per draw you probably need to move to VAO per mesh.

Another thing is the way you store in the VBO. For better caching you should use something like this:
mesh_0[position_vec3_0, tex_coords_vec2_0, norm_vec3_0, position_vec3_1, tex_coords_vec2_1, norm_vec3_1…] mesh_1[…]

If you are drawing the same mesh many times per frame (eg particles) you could use instanced rendering. Its insanely fast

For the explicit vs automatic attribute locations I haven’t seen any performance difference. GL probably doesn’t care

You’ve already gotten some good info. In general, if I were you and considering a performance choice, I would test both (or all) options with your GL usage and datasets. Performance tradeoffs can shift with your usage quite a bit. For instance, in my experience with separate VBOs per object, VAOs were better than nothing, but bindless surpassed even VAOs and roughly equaled NVidia’s legendary display list performance. Combining bindless with VAOs yielded a little worse perf than with bindless alone. However, after switching rendering largely from the “separate VBO” model (which is very limiting, particularly when your runtime datasets are too big to fit on the GPU) to a “streaming VBO” model (streaming the data to the GPU in a single streaming VBO prior to batch launch, with caching and reuse when possible), then obviously the “VBO switching” went way down and so of course did the performance benefit of bindless and VAOs. So it’s all in how you choose to store and launch your batches. Just keep in mind that bindless is (unfortunately) still NV only, so if you support it be sure to make its use conditional on whether the extensions are supported.

Also, direct state access like bindless is a good thing – similar in spirit. Simplifies your code and use of client GL libraries mainly, but also let’s you stop having to attach things to shared bind points before you can operate on them – let’s you just operate on the objects (by handle) directly which is convenient. And cross-vendor too.

Thank you both. I guess I’ll have to try and measure…

The first thing I tried were UBOs and the result is disappointing.
If I run my demo with just a simple diffuse shader I’m rendering 179 meshes. Parts of the state are updated per shader change (i.e. per frame in this case), when the transform matrix changes (somtimes) and per mesh.

Per mesh this meant 3 calls to glUniform4f (diffuse, specular and ambient material color). I replaced those uniforms with a uniform buffer object:

layout(std140) uniform PerObject {
uniform vec4 ambientDiffuse;
uniform vec4 lightDiffuse;
uniform vec4 specular;

The size of the UBB is reported to be 48 bytes, created once for the application as GL_DYNAMIC_DRAW and updated with glBufferSubData per mesh.
The impact of performance is quite bad. If I call glBufferSubData for each mesh rendering time is 8.6 msecs. Removing just that glBufferSubData line reduces time to 6.9 msecs. Rendering time used to be 6.6 msecs when I used only uniforms.

Any idea what might go wrong here? Did the standard layout perform well for you? Is GL_DYNAMIC_DRAW or glBufferSubData a bad choice?