I went over some of this suggestion in another thread, but I wanted to clarify some of the points and create a new discussion rather than derail that thread.
Current Performance Problems in OpenGL
It is inferred from NVIDIA’s work on the bindless graphics API that OpenGL has a number of basic inefficiencies in its vertex specification pipeline that create a large number of client-side memory accesses for each draw call.
The purpose of this proposal is to solve these problems without resorting to low-level hackery as in the bindless graphics extensions.
Origin of the Problem
Not being an NVIDIA driver developer, I can only speculate as to the ultimate source of the client memory accessing. Thus, this analysis may well be wrong, thus leading to a wrong conclusion.
The absolute most optimal case for any rendering command is this: add one or more command tokens to the graphics FIFO (whether the actual GPU FIFO or an internal marshalling FIFO). This is the bare minimum of work necessary to actually provoke rendering.
The first question is this: what is in these tokens?
The implementation must communicate the state information of the currently bound VAO. Which vertex attributes are enabled/disabled, what buffer objects+offsets+stride they each use, etc. Basically, the VAO state block.
However, in GPU lingo, some of that state block contains different data. Specifically as it relates to buffer objects. All the GPU cares about is getting a pointer, whether to video memory, “AGP” memory or whatever it can access.
The VAO stores a buffer object name, not a GPU address. This is important for two reasons. One, buffer object storage can be moved around by the implementation at will. Two, buffer object storage can be ‘‘reallocated’’ by the application. If you have a VAO that uses buffer object name “40”, and you call “glBufferData” on it, the VAO must use the new storage from that moment onward.
#2 is a really annoying problem. Because buffer objects can be reallocated by the user, a VAO cannot contain GPU pointers even if the implementation wasn’t free to move them around.
This means that, in order to generate the previously-mentioned tokens, the implementation must perform the following:
1: Convert the buffer object name into a pointer to an internal object.
2: Query that object for the GPU address.
3: If there is no GPU address yet… Here be dragons!
The unknown portion of step 3 is also a big issue. Obviously implementations must deal with this eventuality, but exactly how they go about it is beyond informed speculation. Whatever the process is, one thing is certain: it will involve more client-side memory access.
Here is the thing: if an implementation could know, be absolutely certain that the GPU address of all of a VAO’s buffer objects would not change, then the implementation could optimize things. The VAO’s state block could be boiled down into a small block of prebuilt tokens that would be copied directly into the FIFO. Now even in this case, you still need to:
1: Convert the VAO name into a pointer (generally expected to be done when the VAO is bound).
2: Copy the FIFO data into the command stream.
The second part requires some client-memory access. But it’s the absolute bare minimum (without going to full “let the shader read from arbitrary memory” stuff).
How to do This
The bottlenecks of client-side memory access have been identified. So how do we solve this?
We provide the ability to lock VAOs.
When a VAO is locked, this relieves the OpenGL implementation from certain responsibilities. First, a locked VAO is immutable; the implementation no longer has to concern itself with changing things at the user’s whim. A locked VAO that is deleted will continue to exist until it is unlocked.
Second, all buffer objects attached to that VAO at the time of locking are themselves locked. Any attempt to call glBufferData or any other function that gives the implementation the right to change the buffer object’s storage will fail so long as that buffer object is attached to a locked VAO. Multiple VAOs can lock multiple buffer objects.
Implicitly locking buffer objects also has the effect of providing a strong hint to the implementation. Unlike the bindless graphics ability to make buffer objects resident, it does not force the implementation to fix the object in memory. But it does strongly suggest to the implementation that this buffer object will be in frequent use, and that it should take whatever measures it needs to in order to keep rendering with this data as fast as possible.
To help separate locked VAOs from unlocked ones, the locking function should return a “pointer” (64-bit integer). It is illegal to bind a locked VAO at all; instead, you must bind the pointer with a special bind call (that automatically disables the standard bind point).
Comparison to Bindless
This suggestion cannot achieve 100% of the performance advantage of the full bindless API (that is, just giving vertex shaders a few pointers and having them work). However, it be able to remove enough issues that it can achieve performance parity with GL_NV_vertex_buffer_unified_memory.
Speaking of which, GL_NV_vertex_buffer_unified_memory tackles this issue in a different way. It uses the bindless shader_load API to allow you to bind bare pointers rather than buffer objects. This in turn relies on making buffer objects resident, which gives them a guaranteed GPU address.
This is an interesting idea, but it relies on a lot of manual management. You have to make specific buffer objects resident, and you have to remember yourself what the reason was behind this residency. It also requires the concept of a “GPU address” and so forth.
This example is much more OpenGL-like. It keeps the low-level details hidden while allowing the implementation to make optimizations where appropriate. It is much safer as well; there are a number of pitfalls with GL_NV_vertex_buffer_unified_memory (like rendering when you made a buffer non-resident, etc) that this API can easily catch.
It is a targeted solution to a specific problem.