OpenGL v2.1 + GLSL v1.2 => OpenGL v3.1 + GLSL v1.4

We’ve been creating a portable 3D game/graphics/simulation engine for nearly two years now. Our computers were limited to OpenGL v2.10 and GLSL v1.20 until just today when we upgraded to GTX285 cards and drivers that support OpenGL v3.10 and GLSL v1.40.

We worked hard to make our application as fast and efficient as possible given the OpenGL/GLSL versions we had, and now want to upgrade any part of our engine that can be made substantially faster or more flexible. Our current design is 100% based upon the fastest approach we could concoct with IBOs/VBOs/FBOs and v1.20 shaders.

Our question now is this. What new capabilities and features in OpenGL v3.10 and GLSL v1.40 are potentially most fruitful to explore to make our engine faster and more flexible?

Of course, feel free to link to articles and PDFs that already address these questions.

Thanks in advance for all suggestions.

Off-hand, some new stuff you can take advantage of:

Vertex Array Objects (VAOs) can reduce the cost of changing vertex streams.

Uniform Buffer Objects (UBOs) can reduce the cost of uniform updates (one of the slowest parts in modern OpenGL programs).

DrawElementsInstanced can accelerate the rendering of similar models.

You can rely on new texture formats (e.g. filtered, 32bit floating-point, RG-channel textures). You can take advantage of this to reduce memory consumption or increase speed e.g. in variance shadow mapping.

MappedBufferRange is a kick ass feature!

But It can involved significant code changes for the buffer management.

bindless buffer object can increase application performance up to 7x.
no need to bind ur buffer at each frame. there is a feedback from the driver to GL application so u can use the C style pointer to access buffers.

GL_NV_shader_buffer_load and GL_NV_vertex_buffer_unified_memory

I’m pretty sure the OP asked for OpenGL 3.1 / GLSL 1.4 features, not vendor-specific extensions. :slight_smile:

If you have the motivation and man-power to utilize this short of thing, you should also check EXT_geometry_shader and custom multisample resolves (both are NV-only at the moment).

To use this, you must limit your program to NVIDIA hardware. You must also also rewrite all of your shaders to use a completely different style of data access.

If you want to avoid the platform restriction, you must implement and maintain 2 different sets of shaders. One set works with bindless; the other set works with standard OpenGL.

You could use texture arrays to limit the texture bind calls. In your shaders you can specify which texture layer from the texture array you want to use.

Thanks for all the great feedback. Keep it coming! I’ll ask followup questions to everyone in this one reply.

— VAOs —
Would I be correct to say that VAOs are simply VBOs with the offsets/strides/datatypes of the contained vertices permanently bound? Presumably this is done to eliminate the need to specify offsets/strides/datatypes each time you make a VBO active and render it. Is that all VAOs are, or did I miss something?

— UBOs —
Would I be correct to say that a UBO is the complete set of uniform variables the shaders expect? If I understand this correctly, an application would need to define a set of offsets/strides/datatypes for the individual elements in a specific “uniform buffer object” just once. Then just before the application calls an OpenGL render function, it would update the values in its “uniform buffer object” image in CPU memory, then tell the driver where it is before the render call. The driver would then load all uniform variables into the GPU before rendering begins. Is this approximately correct?

— DrawElementsInstanced —
I recall reading somewhere that a new built-in shader variable came into existence in some version of OpenGL/GLSL after v2.10/v1.20 — called a vertexID number or similar. My assumption is, this vertexID number identifies which vertex in the VBO is currently being processed by each vertex shader (starting at zero, I assume). I guess this would be the value fetched from the IBO (the VBO that contains indices into the vertex VBO). That would seem to provide what is required for “instancing”. Thus I don’t see a need for special instancing draw calls. What am I missing?

— MappedBufferRange —
This I do not understand. Our engine calls glBufferSubData() regularly, which we assume updates a portion of a VBO (sometimes the entire IBO or VBO in our case). I must be missing something about the intent of this function.

— OpenGL standard versus extension —
You are both correct. We are interested in opportunties that are extensions today IF they are fairly likely to become core eventually (in similar form). As long as ATI remains a viable and popular source of high-performance video cards, we prefer not to lock our software into nvidia (even though we have been 100% nvidia since the beginning). nvidia has been great, but we have nothing against AMD — all our CPUs are Phenom2s!

— bindless graphics —
I read the nvidia PDF and the two extension text files, but have not gotten my brain around this yet. First, I find it difficult to believe that cache misses in the driver caused by looking up GPU addresses can slow any application by 7%, much less 700%. However, I applaud on principle the practice of letting CPU software control the GPU on the lowest feasible level.

It appears that VAOs eliminate the need to specify the offsets/strides/datatypes before each render. How much more efficiency does this extension offer over VAOs (which presumably are standard OpenGL)?

— texture arrays —
Is a texture array [?object?] different from a 3D texture? Are they different in the sense each texture in a texture-array can be different size [and format]? That would be very nice indeed, and much more convenient than our “hack” with 3D textures.

— maximum speed techniques —
Currently our engine has large IBOs and VBOs (65536 elements each), and typically we render each IBO/VBO pair in one or two OpenGL glDrawElements() or glDrawRangeElements() calls. We can do this because we make the CPU transform every vertex to world coordinates (because we need world-coordinates for collision detection and simulations of several physical processes). Our vertex contains a 16-bit (now integer!!!) of flag bits that can change the behavior of the shader. All this combines to let us render up to 65536 vertices per draw call, thereby amortizing the overhead involved in state changes over 65536 vertices. Every once in awhile we think “maybe this way is a mistake”, but so far our analysis and tests say this way is best, all things considered (for our engine, anyway).

Lately we have been wondering whether we should take this approach even further, switch to 32-bit indices, and put all our vertices into one huge IBO/VBO pair (up to ~30 million vertices). We could render large subsets of the IBO/VBO by calling glDrawRangeElements(), then update vertices outside that range by calling glBufferSubData(). That’s what we do now, except we always update the contents of each VBO before we render it (we never modify an IBO or VBO being rendered).

Our main motivation is not to increase performance, since our batch size is already huge, so further increases would likely not improve throughput measurably.

Instead, our main motivation is flexibility - to allow our engine to dynamically regroup “objects” in any way it wishes, simply by reloading modest subsections of the IBO only (vertices in the VBO never need to move when they are all inside one VBO).

Why would we want to do this? Here is one possibility, for example. Imagine a cube/tetrahedron/icosahedron/opportunistic centered around the camera/viewpoint with the camera pointing through the center of one face (or through a vertex). This divides the universe into 6/8/20/more volumes, each containing the centroid of some subset of all [game/simulation] objects. The objects in several to many of these volumes are not visible given the direction the camera is pointing (and the field-of-view of the camera). The engine can simply NOT DRAW the objects in any portion of the IBO that corresponds to these invisible volumes.

As objects move around in the environment from frame to frame, zero to a few objects will pass from one volume into another volume on each frame. The object can be removed from one volume and put into another simply by moving the object indices from one section of the IBO to another (and recompacting the “from” section of the IBO).

This is just one of several opportunities we find interesting, none of which work without switching to a single huge IBO/VBO pair. Any ideas and comments are welcome.

Texture arrays are different because the 3rd dimension is not reduced with increasing mipmap levels. The different layers are really 2D textures with no filtering between them. It is really an array of 2D textures :slight_smile:

Correct. But all textures must have the same dimensions and format if I recall correctly. So the only difference (I think) is that no filtering is possible in the 3rd dimension. And there is no such thing as border layers (not sure if these are available in 3d textures…).

Your latter two sentences are near correct. The first one implies something different.

VBOs (or client arrays) still encapsulate the “contents” of vertex attribute arrays, while VAOs encapsulate the “bindings” of those VBOs (or client arrays) to vertex attributes and the enabling/disabling of those vertex attributes for drawing a batch.

Reduces the number of GL calls needed to render a batch. So in the case where you only need to change a texture between batches, you have:

  glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ...)
  glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ...)

To use this, you must limit your program to NVIDIA hardware. You must also also rewrite all of your shaders to use a completely different style of data access.

Correct if you want to get the most out of bindless graphics. But you can limit yourself to just the OpenGL API side and only use the vertex buffer unified memory part of bindless graphics. This presentation explains this in more detail.

(with my NVIDIA hat on)

To use this, you must limit your program to NVIDIA hardware. You must also also rewrite all of your shaders to use a completely different style of data access.[/QUOTE]

Correct if you want to get the most out of bindless graphics. But you can limit yourself to just the OpenGL API side and only use the vertex buffer unified memory part of bindless graphics. This presentation explains this in more detail.

(with my NVIDIA hat on)[/QUOTE]
I’ve looked at that tutorial, and understand bindless graphics at a general level so far. I have three followup questions related to bindless graphics.

#1: Is “bindless graphics” (in some form) likely to ever achieve core status? I believe it should, but whether it does likely depends upon whether the internal architecture of AMD/ATI is similar enough to nvidia to support a common interface. On a personal note, I am 100% supportive of any feature that gives applications lower-level control of the GPU, even if it requires the application keep track of more details (that seems to always be necessary to be efficient in any case).

#2: Except for a small number of special cases, our current engine renders each 65536 element IBO/VBO pair with one call of glDrawElements(). As long as we keep this architecture (which costs us only modest overhead elsewhere), should we expect only minor performance gains with bindless graphics?

#3: We are planning to switch over to VAOs in the next month or two, but I cannot quite get my brain to fully understand them. On the one hand, it seems like an application can make an IBO/VBO pair, plus all the vertex attribute types/offsets/normalizations active simply by binding one VAO. However, I do not see (in the OpenGL v3.10 specifications) where the VAO state contains the VBO identifier.

At first I thought the very last item in table 6.4 was the VBO identifier, but then I noticed the table says it contains up to 16 of these state, which it calls VERTEX_ATTRIBUTE_ARRAY_BUFFER_BINDING.

Either the documentation is wrong and there is only room for one VERTEX_ATTRIBUTE_ARRAY_BUFFER_BINDING, or the VAO can actually have up to 16 separate VBOs attached, or I’m just totally confused.

If the second case is correct, how does a program bind 16 separate VBOs to one VAO? I don’t see how the OpenGL syntax even makes that possible.

    // Bind the vertex data
    glBindBufferARB(GL_ARRAY_BUFFER, VertexName);
    glVertexPointer(3, GL_FLOAT, 0, 0);

    // Bind the color data
    glBindBufferARB(GL_ARRAY_BUFFER, ColorName);
    glColorPointer(4, GL_UNSIGNED_BYTE, 0, 0);

    // Bind the normal data
    glBindBufferARB(GL_ARRAY_BUFFER, NormalName);
    glNormalPointer(GL_FLOAT, 0, 0);

    // Bind the texcoord data
    glBindBufferARB(GL_ARRAY_BUFFER, TexCoordName);
    glTexCoordPointer(2, GL_FLOAT, 0, 0);

    // Now enable/disable arrays (yes, in reality you should do this lazily, but we're just illustrating the point)
    glEnableClientState( GL_VERTEX_ARRAY );
    glEnableClientState( GL_COLOR_ARRAY );
    glEnableClientState( GL_NORMAL_ARRAY );
    glEnableClientState( GL_TEXCOORD0_ARRAY );
    // And disable any others that might be enabled here
    glDisableClientState( ... )

All this “stuff” would be bound up in a VAO, so that you could just call:

glBindVertexArray( my_vao )

and be done with it.