Lock API

Seems you could just add a usage hint to glBufferData, GL_IMMUTABLE (replacing GL_STATIC_DRAW and friends). When specifying this, the GL is not required to support calling glBufferData or GL_INVALIDATE_BIT on that buffer.

It’s not as flexible as your idea, but much simpler.

Come to think about it ‘not required to’ may not be strong enough, ‘will not’ might be better.

Regards
elFarto

Nice talks guys! I didn’t had time to read this thread before:

It follows well those threads:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=256729
And other older topics!

I still think that VAO were a mistake, API sugar that doesn’t provide anything and as everyone notice, it doesn’t provide anything…

I notice something on the thread I would like to clearify: drivers and GPUs change the content of the buffer of the image, twiddle or even compress the data, buffers and images for hundred of good reasons that always end up to: Memory bandwise if golden!

1: Any kind of VAO lock API should allow the drivers and the GPUs to use these fancy features that ‘optimize’ the data.
2: The API need to feet with at least some use of the programmers (unlike VAOs!!!), like your example with skinning but instancing is an other one.

Ok ok my underlying idea: Standard lossless image and buffer compression formats.

Seems you could just add a usage hint to glBufferData, GL_IMMUTABLE (replacing GL_STATIC_DRAW and friends). When specifying this, the GL is not required to support calling glBufferData or GL_INVALIDATE_BIT on that buffer.

That’s not a good idea. For several reasons.

You want to be able to turn it on and off as you need to. If you know a particular object isn’t going to be used for a while, but you want to keep it around (rather than having to rebuild its buffers), you can unlock its VAO. This gives the GL the freedom to move the buffer object out of video memory if it needs to.

There is also another problem. The usage hints are still important when a buffer object is locked. You can still map the buffer, and you should still reasonably be able to stream data into locked buffers.

I still think that VAO were a mistake, API sugar that doesn’t provide anything and as everyone notice, it doesn’t provide anything…

Why? And in what way does it “not provide anything?” The only problem with VAOs giving performance improvements are problems with buffer objects. That they can be created/destroyed/respecified, so when you render with them, the code must fetch the buffer object and get its GPU address. That’s not a problem of VAOs specifically; that would still happen if you were doing all the binding yourself.

I notice something on the thread I would like to clearify: drivers and GPUs change the content of the buffer of the image, twiddle or even compress the data, buffers and images for hundred of good reasons that always end up to: Memory bandwise if golden!

Drivers are allowed to do this with images, but not buffer objects. They cannot compress buffer object attribute data in any way.

Any kind of VAO lock API should allow the drivers and the GPUs to use these fancy features that ‘optimize’ the data.

That should be left to some other API that allows the driver to accept a number of rendering commands to create a special drawable object. That API might look something like this:


glBeginRender(VAO, GL_OWN_BUFFER_OBJECTS)
  glDrawElements(*);
  glDrawElements(*);
GLobject theObj = glEndRender();

glRender(theObj);

The GL_OWN_BUFFER_OBJECTS means that the GL driver is free to modify the buffer objects attached to the VAO. However, it also means that these buffer objects can no longer be accessed by the user; they behave as if the user deleted them. The VAO is also deleted.

There might also be a GL_LOCK_BUFFER_OBJECTS flag that forces the buffer objects to become locked while this render object exists. This would then behave much like a locked VAO, but with implicit rendering commands.

The fact that it collects rendering calls rather than just a VAO is important. That allows the driver the freedom to cull out buffer object data that happens to be not used in those rendering commands.

Really? Do you have any reference for this?
I would say that even if it’s not allowed, it’s been done before and it’s going to become more and more present in future GPUs.
A well know and basic example, was the int to short int conversion of index buffer on nVidia chips / drivers when it was possible. It’s been efficiently tested in the past!

Typically, this glMakeBufferResident function is could be a place were the untwiddling / decompression could be request. It will hide some memory latency and present a buffer that can actually be read. This is were glMapBuffer is an issue, the whole buffer must be ready to use (untwiddle / uncompressed) before the function return, glMapBufferRange is better here.

Buffer compression and twiddling is great for GPUs it’s easy to reach 50% of memory bandwise saving so that even is it’s not allow yet by OpenGL, this API should relax this constraint when the developer ask it … and there is many situation where this is possible! (Apply a wavelet on a mesh a display the histogram, the results are stunning, some data could be compressed up to 1bit per memory bust using 1/64)

nVidia announced 7x with bindless graphics … I don’t really believe it is just the result of a new API. There is something going on behind this.

[quote]I still think that VAO were a mistake, API sugar that doesn’t provide anything and as everyone notice, it doesn’t provide anything…

Why? And in what way does it “not provide anything?” The only problem with VAOs giving performance improvements are problems with buffer objects. That they can be created/destroyed/respecified, so when you render with them, the code must fetch the buffer object and get its GPU address. That’s not a problem of VAOs specifically; that would still happen if you were doing all the binding yourself.
[/QUOTE]

Because most of the time all it does is for each mesh it makes you write more code than because for 0% of efficient gaim. Sometime (quite often actually) it results in a explosion of object count.

It just behave like a function call wapper. It doesn’t allow to lock the buffer access of even the memory address because of reallocations or just the way the memory controller work to keep the memory access efficient. This is just a API wrapper. Some sugar.

Second case where it shall do something: the attributes descriptions (offset, type, stride, etc). Too bad you can’t change the buffer in the VAO and assume that you are going to use the same buffer format. (Typically, if you have 10 different mesh for 10 different animated characters, each single buffer maybe different and for each buffer a single VAO). You can’t assume that when you change you VAO, the buffers attributes are the same, so could check for each attribute but it is just easier and faster to just bind everything again. That’s where the bindless graphics API is so good, you bind once your format and display you 10 characters!

“nVidia announced 7x with bindless graphics”

Yep, the algorithm they use is called “marketing”.

Usually it fails big time.

A well know and basic example, was the int to short int conversion of index buffer on nVidia chips / drivers when it was possible. It’s been efficiently tested in the past!

When you upload an image format, the driver has the right to corrupt your data. It does not guarantee in all cases that the exact colors you specify are what you get out.

With buffer objects, they do make that guarantee. Which is why nVidia only does this conversion when it can. That is, when it will not affect the absolute value of the data.

It is considered bad form for drivers to do this. That is because the driver must do special processing on the first render with this element buffer. This helps contribute to NVIDIA’s love for “first render” hitches.

Because most of the time all it does is for each mesh it makes you write more code than because for 0% of efficient gaim.

The purpose of the beginning section of my post was to propose an explanation for the “0% of efficient gaim”. Do you have an alternative explanation? Do you have reason to believe that the lock API would not solve the problem?

Sometime (quite often actually) it results in a explosion of object count.

I don’t understand how object count is an issue. These are small structs; they don’t take up much room. And they’re client-side data, so it’s not like you’re using up precious GPU memory.

“nVidia announced 7x with bindless graphics”

Yep, the algorithm they use is called “marketing”.

Usually it fails big time.

I get the point you’re making. But NVIDIA would be foolish to put numbers out that are so easily proven wrong. They can be exaggerated. But they wouldn’t bother with bindless graphics unless there was some significant speedup.

A reasonable question to ask is this: does bindless graphics mean that NVIDIA will make no effort to use VAOs to improve performance?

It seems to me that this locking API is awkward and won’t yield any performance benefits. Locking can already be done in a driver automatically, assuming everything is locked by default. When a buffer is reallocated or a vertex array state is changed, the appropriate VAO can be marked “dirty” (unlocked) and can be locked again once it is used to render something. That means VAOs which aren’t changed stay locked forever. The proposed explicit locking seems to be just another hint with some sugar here and there.

First, we need to resolve some design issues in VAOs and that is to decouple a vertex format and vertex data. It’s one of the things bindless graphics came up with and certainly had impact to its success, among other things.

Locking can already be done in a driver automatically, assuming everything is locked by default. When a buffer is reallocated or a vertex array state is changed, the appropriate VAO can be marked “dirty” (unlocked) and can be locked again once it is used to render something.

Yes, a driver could do this. But they don’t. It takes too much effort and requires a lot of back-pointers from buffer objects to VAOs. There may even be internal reasons why it can’t be done.

The purpose of the lock API is to give the implementation the freedom it needs to do this easily, by taking freedom away from the user.

First, we need to resolve some design issues in VAOs and that is to decouple a vertex format and vertex data. It’s one of the things bindless graphics came up with and certainly had impact to its success, among other things.

Bindless graphics doesn’t uncouple vertex formats from vertex data. All it does is allow you to use pointer values rather than buffer object names.

What?!! When I was speaking of images and buffers twiddling and compression I didn’t said anything about lossy: I said lossless! I would be really surprised if there is anything lossy on buffers and images on GPUs those days.

Come on! Have a look on both extension you will see it does!

I think bindless graphics got it right for both issues I was talking before and I do believe in this 7x things. Probably a very specific case with a lot of draw call with same format and static buffers!

VAO for buffer and vertex format is like texture for image and filter. As the current state of the OpenGL spec, I want this feature deprecated.

Come on! Have a look on both extension you will see it does!

No, it doesn’t. In the standard case, VAOs store buffer object names and an offset. In the bindless case, VAOs store buffer object addresses and an offset.

All bindless does is remove the buffer object’s name itself from the equation. Which means that, when rendering with bindless, the rendering system no longer has to test the buffer objects (to see if they exist, get GPU addresses, etc).

You still have to either store the bindings in a VAO or keep calling “glBufferAddressRangeNV” and “glVertexAttribFormatNV” to build your attribute data. This is directly equivalent to “glBindBuffer” and “glVertexAttribPointer”.

You really should read the original post. I spend a great deal of time explaining where I think NVIDIA gets their speedup from with bindless, and how locking mimics this almost exactly. If you have a problem with my reasoning there, please explain what it is.

VAO for buffer and vertex format is like texture for image and filter.

That analogy does not work. Buffer objects already store the vertex data. So there is already separation between the raw data and how that data gets used. Because of that, you can have many, many VAOs that all use the same buffer objects. Textures can’t do that.

The part of VAOs that you seem to be having a problem with are the storage of buffer object name+offset, or in bindless, buffer object GPU address+offset. But the only “problem” this creates is making lots of VAOs. And I don’t understand why this constitutes a problem.

If they are unable to optimize it, they never will no matter what API you throw at them. I have already given you the idea how to make it efficient, and I disagree with the fact that some locking “hint” will make drivers faster than ever.

The reason bindless graphics is here in the given form is that NVIDIA might have admitted that even though GPU pointers may have provided some improvements, changing the vertex format is still costly, so this is the reason it’s separate from the rest. To make the best of bindless graphics, using just GPU pointers will not make your applications dance. If you had taken a look at the vertex_buffer_unified_memory spec, you would know that the only example that’s there sets the vertex format once and renders many times. EDIT: Storing the address+offset in VAOs is done in the same way regardless of availability of bindless graphics. Due to the aforementioned reasons, VAOs in conjunction with bindless graphics might actually hold you back.

Decoupling buffer bindings and vertex formats might not just improve performance the same way bindless graphics does, but what’s more important, it would improve usability. Notice that D3D has it too.

And about this GPU pointers stuff, I’d like to see more direct way of handling buffers in OpenGL. Explicit allocations of device memory and page-locked system memory, memcpy between RAM and VRAM performed by a user, not a driver, which implies having GPU pointers by design. This is quite common in CUDA and would come in handy in OpenGL too. Just dreaming. :wink:

Apparently, I didn’t read the extension thoroughly enough. I was under the impression that it basically replaced the buffer object binding with a pointer.

I’m not sure presently what this would mean for a cross-platform API for improving vertex specification performance.

After thinking about it for a while, this opens up possibilities. And serious problems.

Now, everything I said in my original post may still be valid. That is, when I explained what I thought was the reasoning behind bindless graphics for rendering, that may still be correct. Indeed, I imagine it’s a significant cache issue one way or another. The lock API as it currently stands may get, say, 80% of the performance of bindless.

However, there is also this potential problem. That, for whatever reason, vertex format changes in hardware take more performance than changing the buffers used by that rendering.

The examples in the bindless graphics spec suggest this is the case. But consider this.

The justification for bindless graphics was as a cache issue, not an issue with vertex formats being attached to the GPU addresses for them. Specifically, this was the CPU’s cache. How exactly does the vertex format affect the CPU’s cache?

It may be the case that there’s simply more data. That FIFO chunk I mentioned, if you’re using the same vertex format, would be smaller than if you changed vertex formats. Vertex format information takes up room that’s clearly larger than the GPU addresses that are the the source of those attributes.

Cache lines these days are 64-bytes. That’s big enough for 16 32-bit values (the buffer addresses, if every one of the 16 attributes comes from a different buffer). So in the worst-possible case, you’re guaranteed that the format+address data will be larger than one cache line.

Really, I think the only way to know is to test it. To write an application that completely flushes the CPU’s cache. Then have it do some rendering stuff. One way with the “common” form of bindless (one vertex format, lots of pointer changes). Then with constant format changes, once per render operation. And see what is fastest. The mesh data itself isn’t at issue; indeed, it’s better to just render a single triangle from 200,000 buffer objects. And of course, cull all fragments.

Unfortunately, my knowledge of cache architecture on x86 chips is insufficient to do something that actually flushes the cache fully.

Also, this won’t answer the other important question: is this an NVIDIA-only issue, or is this something that ATI implementations could use some help on too?

If vertex specification is a performance problem, couldn’t you fix it by moving the specification into the vertex shader? I.e have a shader like:

in struct {
   vec3 pos;
   vec3 normal;
   vec2 tex;
} vertex;

Then in the draw call you only have to pass in a buffer/pointer. In fact the current bindless extensions would allow you todo this, by making vertex a pointer, binding a buffer to it and using gl_VertexID to do the lookup.

Regards
elFarto

If vertex specification is a performance problem, couldn’t you fix it by moving the specification into the vertex shader?

Well, that already exists. The vertex shader inputs.

Vertex specification is about providing appropriate buffer objects and interpretations of those buffer objects (ie: the format), so that the pre-vertex shader code can know where “vec3 pos” actually comes from.

What do you mean by ‘pre-vertex shader code’?

Regards
elFarto

What do you mean by ‘pre-vertex shader code’?

There is hardware in GPUs that fetch attributes from memory. That convert the particular format (normalized byte, unnormalized signed short, float, etc) into the expected attribute data for the vertex shader. This hardware understand the format of each attribute and knows what GLSL variable to store it in.

It isn’t necessarily programmable, so “code” was the wrong word.

Ah, Ok, I understand now. Then I suggest we steal DirectX 10’s CreateInputLayout, IASetInputLayout and IASetVertexBuffers API. This might actually help the driver, since calling SetVertexBuffers hints to the driver you’re about to use them (why else would you be calling it?).

Thinking about it a bit more, we’re actually working around a design problem with buffer objects. If the problem is, we can supply completely new data for a buffer effectively changing it’s address, then we should stop that from happening.
Modify the API so that you an only size the buffer once (essentially giving you a malloc/free API).

This removes the need to lock the buffer (you can’t change it, therefore you don’t need to lock it).
If you make the new API return an int64, it can directly return the GPU address, giving you all the benefits of the bindless extensions.
You’ll still need MakeBuffer[Non]Resident. The driver can’t guess when you’ll need what buffers, you need to tell it.

The only issue I can’t figure out is how to make it easy/fast for the driver to swap buffers in/out of GPU memory by just using the GPU address. Any ideas?

Regards
elFarto