Lock API

I went over some of this suggestion in another thread, but I wanted to clarify some of the points and create a new discussion rather than derail that thread.

Current Performance Problems in OpenGL

It is inferred from NVIDIA’s work on the bindless graphics API that OpenGL has a number of basic inefficiencies in its vertex specification pipeline that create a large number of client-side memory accesses for each draw call.

The purpose of this proposal is to solve these problems without resorting to low-level hackery as in the bindless graphics extensions.

Origin of the Problem

Not being an NVIDIA driver developer, I can only speculate as to the ultimate source of the client memory accessing. Thus, this analysis may well be wrong, thus leading to a wrong conclusion.

The absolute most optimal case for any rendering command is this: add one or more command tokens to the graphics FIFO (whether the actual GPU FIFO or an internal marshalling FIFO). This is the bare minimum of work necessary to actually provoke rendering.

The first question is this: what is in these tokens?

The implementation must communicate the state information of the currently bound VAO. Which vertex attributes are enabled/disabled, what buffer objects+offsets+stride they each use, etc. Basically, the VAO state block.

However, in GPU lingo, some of that state block contains different data. Specifically as it relates to buffer objects. All the GPU cares about is getting a pointer, whether to video memory, “AGP” memory or whatever it can access.

The VAO stores a buffer object name, not a GPU address. This is important for two reasons. One, buffer object storage can be moved around by the implementation at will. Two, buffer object storage can be ‘‘reallocated’’ by the application. If you have a VAO that uses buffer object name “40”, and you call “glBufferData” on it, the VAO must use the new storage from that moment onward.

#2 is a really annoying problem. Because buffer objects can be reallocated by the user, a VAO cannot contain GPU pointers even if the implementation wasn’t free to move them around.

This means that, in order to generate the previously-mentioned tokens, the implementation must perform the following:

1: Convert the buffer object name into a pointer to an internal object.

2: Query that object for the GPU address.

3: If there is no GPU address yet… Here be dragons!

The unknown portion of step 3 is also a big issue. Obviously implementations must deal with this eventuality, but exactly how they go about it is beyond informed speculation. Whatever the process is, one thing is certain: it will involve more client-side memory access.

Here is the thing: if an implementation could know, be absolutely certain that the GPU address of all of a VAO’s buffer objects would not change, then the implementation could optimize things. The VAO’s state block could be boiled down into a small block of prebuilt tokens that would be copied directly into the FIFO. Now even in this case, you still need to:

1: Convert the VAO name into a pointer (generally expected to be done when the VAO is bound).

2: Copy the FIFO data into the command stream.

The second part requires some client-memory access. But it’s the absolute bare minimum (without going to full “let the shader read from arbitrary memory” stuff).

How to do This

The bottlenecks of client-side memory access have been identified. So how do we solve this?

We provide the ability to lock VAOs.

When a VAO is locked, this relieves the OpenGL implementation from certain responsibilities. First, a locked VAO is immutable; the implementation no longer has to concern itself with changing things at the user’s whim. A locked VAO that is deleted will continue to exist until it is unlocked.

Second, all buffer objects attached to that VAO at the time of locking are themselves locked. Any attempt to call glBufferData or any other function that gives the implementation the right to change the buffer object’s storage will fail so long as that buffer object is attached to a locked VAO. Multiple VAOs can lock multiple buffer objects.

Implicitly locking buffer objects also has the effect of providing a strong hint to the implementation. Unlike the bindless graphics ability to make buffer objects resident, it does not force the implementation to fix the object in memory. But it does strongly suggest to the implementation that this buffer object will be in frequent use, and that it should take whatever measures it needs to in order to keep rendering with this data as fast as possible.

To help separate locked VAOs from unlocked ones, the locking function should return a “pointer” (64-bit integer). It is illegal to bind a locked VAO at all; instead, you must bind the pointer with a special bind call (that automatically disables the standard bind point).

Comparison to Bindless

This suggestion cannot achieve 100% of the performance advantage of the full bindless API (that is, just giving vertex shaders a few pointers and having them work). However, it be able to remove enough issues that it can achieve performance parity with GL_NV_vertex_buffer_unified_memory.

Speaking of which, GL_NV_vertex_buffer_unified_memory tackles this issue in a different way. It uses the bindless shader_load API to allow you to bind bare pointers rather than buffer objects. This in turn relies on making buffer objects resident, which gives them a guaranteed GPU address.

This is an interesting idea, but it relies on a lot of manual management. You have to make specific buffer objects resident, and you have to remember yourself what the reason was behind this residency. It also requires the concept of a “GPU address” and so forth.

This example is much more OpenGL-like. It keeps the low-level details hidden while allowing the implementation to make optimizations where appropriate. It is much safer as well; there are a number of pitfalls with GL_NV_vertex_buffer_unified_memory (like rendering when you made a buffer non-resident, etc) that this API can easily catch.

It is a targeted solution to a specific problem.

Good, we are moving somewhere.
Let’s discuss what would happen if the locks are implicit.

Fictional(?) driver:
When new VAO is created it locks all the buffers and bakes all the information it needs. Including GPU addresses. So there is no complicated verification/resolution during rendering. By locking I mean to flag the buffer that is used inside VAO and store a back-pointer to that VAO.

Questions:
Q: What if someone deletes buffer used in VAO?
A: No worries, the buffers use reference counters. The buffer cease to exist when the last user is destroyed.

Q: What if someone modifies buffer used in VAO?
A1: Not allowed. Error is reported. (too strict IMHO)
A2: Driver makes copy-on-write. The value inside VAO is not changed. New VAO must be created to use the new data. (the opposite to the current OpenGL) (too strict IMHO)
A3: VAO is updated, including all the baked GPU addresses. This requires to have back pointers from buffer objects to all VAOs using it.

Q: What if the buffer is not filled with data at the time of VOA is created?
A1: Error is reported
A2: buffer is not used until data are available, see A3 above

Q: Do we need to change OpenGL API/spec?
A: No, all is done under the hood inside driver,

Q: What is the benefit?
A: It allows the “create once, use many times” usage pattern.

This way the driver moves the CPU load from usage to creation. This would work when one buffer is bound to limited number of VAOs. Otherwise the buffer update would be too costly.

Let’s discuss what would happen if the locks are implicit.

You can’t have implicit locks. Not without fundamentally (and non-backwards-compatibly) changing how the expected behavior works.

Remember: the driver is free to move buffer object data around as it sees fit. The driver has no way of knowing whether a particular VAO is “currently” in use or will be used in the future. The best it can do is take a few guesses based on usage patterns, but that is very complicated compared to the user providing real information.

One of the purposes of locking a VAO is to tell the driver when it is not a good idea to do that with this object. That communication is important and more direct than any usage pattern guessing.

Further, if implementations could do this implicitly, then they already would and bindless graphics wouldn’t be much of a speed increase. Clearly there are things in the OpenGL specification that make this optimization difficult if not impossible without spec changes.

Q: Do we need to change OpenGL API/spec?
A: No, all is done under the hood inside driver,

The spec would definitely have to change. Most of the answers to the questions you ask require spec changes.

Remember: the driver is free to move buffer object data around as it sees fit.

Yes, let it be this way. But once the buffer moves the driver updates all the VAOs that use this buffer. (list of back pointers helps here)

The driver has no way of knowing whether a particular VAO is “currently” in use or will be used in the future.

C’mon. Driver knows if VAO is in use. What can happen if in use? Let it be the same as PBO. If in use, then:
a) wait (e.g. when BufferSubData is called)
b) or make a shadow copy (copy on write)

bindless graphics wouldn’t be much of a speed increase

there cannot be much faster path other then NVIDIA bindless API. I am not surprised it is fast. I’d would be interested of the speed up compared to VAO.
BTW. Do not forget the display lists are the fastest way to render static geometry on NV HW.

The spec would definitely have to change. Most of the answers to the questions you ask require spec changes.

Name it.

I really think there are only two ways.
a) making very small changes to current API and optimized drivers. Look at NV 180.x, they optimized drivers a lot. So it clearly shows that there is still some room.
b) making more low level stuff. Maybe not that low level as NVIDIA presented. The most problems with their approach is the need to change the shaders.

To help separate locked VAOs from unlocked ones, the locking function should return a “pointer” (64-bit integer). It is illegal to bind a locked VAO at all; instead, you must bind the pointer with a special bind call (that automatically disables the standard bind point).

Then it could be called glMapMemoryGPU(…) and glUnmapMemoryGPU(…) + glBindMemoryGPU(…)

But once the buffer moves the driver updates all the VAOs that use this buffer. (list of back pointers helps here)

That’s insane. Programs should be reasonably free to have tens of thousands of these. To have buffer object modification code iterate through a list of objects to update is ridiculous.

And it doesn’t actually work.

Here is the pseudo-code that a driver has to do when you render with a VAO currently:


foreach attrib in attributeList
{
  bufferObject = GetBufferObject(attrib.bufferObjectName);
  if(!bufferObject.IsBufferInGPU())
    //HERE BE DRAGONS!
}

//Copy tokens into FIFO.

That loop is the source of the problems. Removing that loop is paramount to gaining performance equivalent to bindless.

Don’t forget that the driver can remove a buffer object from video memory entirely if it needs to. At that point it doesn’t have a GPU address, so the backpointers don’t help. It still has to get the buffer object and check to see if it is uploaded. And if not, it has to upload it.

C’mon. Driver knows if VAO is in use.

I don’t mean being rendered with; I mean something that you intend to use in this current frame.

Because if you don’t plan to use that VAO in this frame, the driver needs to be able to page out the buffer objects that the VAO uses. That gives it more freedom to move unimportant things around. Locking the VAO is a strong indication that you’re going to use it, so the API should make it current.

there cannot be much faster path other then NVIDIA bindless API. I am not surprised it is fast. I’d would be interested of the speed up compared to VAO.

My goal is to give the driver the information it needs to get the most optimal performance with a reasonable abstraction.

Name it.

A1: Not allowed. Error is reported. (too strict IMHO)
A2: Driver makes copy-on-write. The value inside VAO is not changed. New VAO must be created to use the new data. (the opposite to the current OpenGL) (too strict IMHO)

Both of these are against the current behavior.

a) making very small changes to current API and optimized drivers. Look at NV 180.x, they optimized drivers a lot. So it clearly shows that there is still some room.
b) making more low level stuff. Maybe not that low level as NVIDIA presented. The most problems with their approach is the need to change the shaders.

Or do what I suggested. It isn’t low-level at all. It maintains the abstraction while providing drivers the opportunity to optimize things.

Then it could be called glMapMemoryGPU(…) and glUnmapMemoryGPU(…) + glBindMemoryGPU(…)

No. “Mapping” is an operation you do to cause some GPU-local memory to become CPU-accessible. This is nothing like mapping.

step by step

That’s insane. Programs should be reasonably free to have tens of thousands of these. To have buffer object modification code iterate through a list of objects to update is ridiculous.

Yes, why not (to have 1000s). It is not free operation to modify a buffer that is inside VAO.
How many VAOs are using one particular buffer?
How many times (per frame) you update such buffer?

Both of these are against the current behavior.

Yes, thats why I put A3 that does it.

No. “Mapping” is an operation you do to cause some GPU-local memory to become CPU-accessible. This is nothing like mapping.

You are mapping it into GPU space. (see suffix GPU). That “mapped” memory would not be accessible by CPU at all.

What NV bindless API is doing IS actually “mapping”. First they force it to be in GPU memory (flag it to not to move) by MakeBufferResident and then they get the GPU address.

How many VAOs are using one particular buffer?

I imagine quite a few. It is often the case that large bits of scenery all are stored in the same buffer object.

You are mapping it into GPU space. (see suffix GPU).

No, you’re not. The driver is allowed to fix the buffer in GPU space, but that isn’t required. All that is required is that the API prevent you the user from doing things that might cause the buffer to be moved (reallocating storage, etc). This gives the driver the freedom to fix the buffer in GPU space. But that behavior is not required.

See, the crucial difference between MakeBufferResident and locking is that MakeBufferResident is something that forces particular behavior on the driver. Locking doesn’t force this behavior; it simply strongly suggests it. That’s what makes locking higher level than making a buffer “resident”.

btw. I have found explanation why they did not use MapMemoryGPU:


6) What does MakeBufferResidentNV do? Why not just have a 
    MapBufferGPUNV?

    RESOLVED: Reserving virtual address space only requires knowing the 
    size of the data store, so an explicit MapBufferGPU call isn't 
    necessary. If all GPUs supported demand paging, a GPU address might
    be sufficient, but without that assumption MakeBufferResidentNV serves
    as a hint to the driver that it needs to page lock memory, download 
    the buffer contents into GPU-accessible memory, or other similar 
    preparation. MapBufferGPU would also imply that a different address
    may be returned each time it is mapped, which could be cumbersome
    for the application to handle.

So things are bit more complicated then I thought.

Giggles. Why not just put the cards on the table and make everyone use nVidia’s bindless graphics? Just kidding. The main part that I find scary in this suggestion is that it is quite complicated to explain and to use, much more complicated than nVidia’s bindless graphics. That and it appears that nVidia’s bindless graphics does it a little better too… just too bad it sort of makes assumptions about how the GPU does its buffer management magic… what will happen if AMD makes their own bindless graphics extension? Will we all die from extension-itus? Probably not, I found it quite easy to make a little abstraction that let me use bindless graphics if it was available, the only sticky part being that it has to track the current format of a vertex attribute, not exactly rocket science. If AMD makes their own, chances are since nVidia already made one, then AMD will make their’s also easy to port to, which usually implies easy to make an abstraction layer that maps to any of the 3 paths (nVidia, traditional no bindless, and hope to be made AMD bindless).

The main part that I find scary in this suggestion is that it is quite complicated to explain and to use, much more complicated than nVidia’s bindless graphics.

Is it?


//Create VAO with attached VBOs.

GLlockedobj pVAO = glLockObject(GL_VERTEX_ARRAY, vaoObjName);

glBindLockedObject(GL_VERTEX_ARRAY, pVAO);

//Do rendering.

Really now, was that hard?

This is why I have issue with it:

  1. The VAO interface is kind of awkward in my eyes, worse the locking does not handle the following very common usage pattern:

Animated keyframe vertex data with non-animated texture co-ordinates.

If the texture co-ordinates were animated as well, one could use the base vertex index added to GL 3.2 (or the appropriate extension), but since that data is not, then the offset into the buffer for the texture data is same across frames, but not for the animated data, thus to use VAO’s one must then have a VAO for each keyframe interpolation pair that one uses.

Secondly, there is the side affect of your proposal: locking a VAO implicitly locks the associated buffer objects. This is going to cause bugs because then one needs to be absolutely 100% sure the underlying buffer objects don’t change. Worse, what about when one needs to use transform feedback too? A reasonable usage patter for transform feedback is to do a very expensive skinning, feed the values into a buffer which in turn is fed into a simpler shader where some skinned object is drawn many times.

However, what is worth noting, a common usage pattern is that the buffer data size is the same, but the values change. Here the bindless graphics API of nVidia deals with this well: use glBufferSubData to change values (for the hardcore, multi-threaded situation, stream to a different buffer and use the copy buffer API in EXT_direct_state_access). In fact glMapBuffer is legal and so is modifying the values in the buffer object, the only requirement is that where the buffer object is located does not change, i.e. don’t call glBufferData.

Perhaps a better middle ground would be that when buffer objects are “locked” it is not their actual data, rather just “where” they are, which is essentially what glMakeBufferReisidentNV does, that and inform the driver one will be using it.

Additionally, the bindless graphics API allows one to completely skip the integer to object conversion, which is a big deal with respect to cache misses, now your proposal of a new kind of object GLlockedobj, which you imagine basically being some kind of pointer handles this, but through an extra layer, where as the bindless API does not add extra layers. With that in mind, something that is easy to understand and use is for object from GL to not be indexed by GLuint, this is absolutely insane in my eyes, much better to just do something like:

typedef struct
{
void *opaque
} GLbufferObject;

typedef struct
{
void *opaque
} GLTextureObject;

typedef struct
{
void *opaque
} GLWhateverObject;

and for the associated calls rather than taking that GLuint to use the above. Also it gives the developer a little help in that now the GL objects are strongly typed. With the above the cache miss goes mostly away (since the driver can have whatever they want on the other side of the pointer).

Animated keyframe vertex data with non-animated texture co-ordinates.

People still do vertex animation in this day and age?

thus to use VAO’s one must then have a VAO for each keyframe interpolation pair that one uses.

I don’t see what the problem with that is. Is there something wrong with having lots of locked VAOs that I am not aware of? VAOs are not large objects, and they have no GPU state. There is no reason you couldn’t have hundreds of thousands of them if you needed.

This is going to cause bugs because then one needs to be absolutely 100% sure the underlying buffer objects don’t change.

I don’t understand what you mean here.

A reasonable usage patter for transform feedback is to do a very expensive skinning, feed the values into a buffer which in turn is fed into a simpler shader where some skinned object is drawn many times.

You seem to misunderstand; maybe I didn’t explain it well enough.

This paragraph only forbids the use of functions that change the buffer object’s type or storage. Basically, glBufferData and glMapBufferRange with GL_INVALIDATE_BIT (or the invalidate part is ignored). All other functions work as advertised. So glBufferSubData, as well as doing a glReadPixels into such a buffer, or doing transform feedback into such a buffer, all of those work just fine.

It is the inability to use glBufferData and mapping with INVALIDATE that gives implementations the freedom to make buffers “resident”.

Additionally, the bindless graphics API allows one to completely skip the integer to object conversion, which is a big deal with respect to cache misses, now your proposal of a new kind of object GLlockedobj, which you imagine basically being some kind of pointer handles this, but through an extra layer, where as the bindless API does not add extra layers.

Bindless graphics and VAO usage are orthogonal. Because the bindless state is VAO state, you can capture this state in VAOs. This is the expected common usage, as it allows the driver to store optimal data.

And what “extra layer” does the locked object have?

With that in mind, something that is easy to understand and use is for object from GL to not be indexed by GLuint, this is absolutely insane in my eyes, much better to just do something like:

That requires all of the work of the EXT_direct_state_access plus changing a lot of function. That’s a lot of new functions, as well as deprecating a lot of old ones.

Longs Peak tried to do this. That effort failed. The ARB is clearly not capable of making changes that are that far-reaching. The Lock-API is a good compromise between the nothing we have now and having pointer-based objects.

This paragraph only forbids the use of functions that change the buffer object’s type or storage. Basically, glBufferData and glMapBufferRange with GL_INVALIDATE_BIT (or the invalidate part is ignored). All other functions work as advertised. So glBufferSubData, as well as doing a glReadPixels into such a buffer, or doing transform feedback into such a buffer, all of those work just fine.

my bad, I definitely did not read it, I think I zeroed in on the immutable part, worse I read it as “data of buffer obejct cannot change”.

With that in mind, all my objections are pants… though I have to admit that having so many VAO floating around to vary what frame one is using seems odd…

For what it is worth keyframe animated thingies are perfectly fine for small objects that are drawn lots of times, but not so many times as instancing demands.

I think your making the problem out way worse than it really is, making VBOs lockable, yea i can go for that, but it’s not really a huge problem as i don’t allow my app to go around messing with my buffers, there is also no real performance problem as both VAO and bindless rendering are approximately equally fast(compared to doing it manually), in fact i have had no problems whatsoever with any of my apps that use VAO.
the only way of improving on this would be to add this.
glDrawVertexArray(GL_TRIANGLES, 0, 143, vao);
Which would solve some issues regarding leaving VAOs open so the app can mess with it.

BTW, bindless graphics is currently an experimental extension, and i like the concept of it, but i rather first have unified buffers than a way to micromanage them.

there is also no real performance problem as both VAO and bindless rendering are approximately equally fast

You have actually stressful benchmarks that show that bindless graphics provides no performance benefit over VAOs? I’d love to see those metrics.

I have no experience with bindless rendering, but as far as i have read, some applications have gained significant speed-ups from it, due to less cache-misses. Now i don’t get the whole discussion here, but i am a bit surprised about the statement that

“VAO and bindless rendering are approximately equally fast”

I did use VAOs and have not gained ANYTHING from them a few months ago. And from some other threads that seemed to be the general consensus. Did that change recently?

Jan.

I only remember benching VAOs at 190.57 on GF8600GT, and VAOs were 5% slower than the VBO way. Provided, .57 was a hotfix on VAOs iirc, it’s understandable. 190.62 reintroduce a bug from previous versions (gl_ClipDistance[]) , which makes me wonder which includes the more up-to-date GL code.
Anyway, I’ll make it a note to bench again these days on several different drivers and on RadeonHDs.

Nope only anecdotal in this case + some logic.

But what i do know is that in normal cases bindless rendering could not gain any significant speedup(as buffer switching is already pretty fast), so even if it takes half the time as a VAO it’s still not a big difference on the whole frame.
Thus it’s only in those cases where your doing an unusually high amount of buffer switches such a speedup will stack up to be significant.

And jan, well i don’t know how drivers treat VAOs today, but i can’t see why it wouldn’t be faster, in fact logically if it just replays commands it would be as fast as normal, so i don’t see why this can’t be optimized.
And if nothing else it sure helped to clean up my rendering thread.

Edit: i ran a mini benchmark and it seems that vao is as fast as without it, though im using 190.15 so i have to upgrade and see what happens when i add bindless into the mix.
But i think my theory still stands.

But what i do know is that in normal cases bindless rendering could not gain any significant speedup(as buffer switching is already pretty fast), so even if it takes half the time as a VAO it’s still not a big difference on the whole frame.

NVIDIA seems to think there is something to be gained out of it. And since they made the hardware, wrote the drivers, designed the extensions, and provided benchmarks of them, I feel it’s more reasonable to accept what they say about it, rather than what one might think.

It seems clear from the design of the bindless graphics API that they feel that VAO’s alone are not sufficient to provide these kinds of optimizations. I posted as part of the original post my ideas on what it is that NVIDIA was trying to avoid.