Toward Version 4.3

thokra · May 11, 2012, 2:35am

Damn…

Eosie · May 11, 2012, 5:52am

NVIDIA’s bindless APIs seem to be a clear proof that OpenGL is flawed and cannot perform well as far as the overall CPU overhead is concerned. See:

Such hacks would not be needed if the design of OpenGL allowed drivers to have low enough CPU overhead in the first place.

thokra · May 11, 2012, 6:17am

OpenGL is flawed. That is true. But you cannot generally state that OpenGL implementations cannot perform well in regard to CPU overhead because the CPU overhead is highly dependent on the use case. For instance, VAOs are a great way to reduce CPU overhead and with a clever buffer manager, instancing, multi draws, all which is of course backed by the current API, you can reduce bindings and draw calls to a minimum already. It’s the same for textures. If you don’t use textures at all, i.e. you render everything procedurally, you don’t have any CPU overhead in that regard.

It’s not the GL which allows drivers to incur low overhead. It’s the hardware. OpenGL and the drivers that implement the spec are merely exposing hardware features on a higher level. The extension is nothing more than a mapping of current hardware features to the GL. That’s not a hack but has been at the center of the extension mechanism for all eternity. Of course, this ‘hack’ could and should be core but as this thread shows, very few of us believe it’ll happen anytime soon.

Alfonse_Reinheart · May 11, 2012, 12:29pm

NVIDIA’s bindless APIs seem to be a clear proof that OpenGL is flawed and cannot perform well as far as the overall CPU overhead is concerned.

No, NVIDIA’s bindless APIs is clear proof that OpenGL can be improved with regard to CPU caching performance. “cannot perform well as far as the overall CPU overhead is concerned” is errant nonsense; OpenGL is significantly better than D3D in this regard, as it allows the driver to do draw call marshalling.

Just check out NVIDIA’s 10,000 draw call PDF. Page 14 clearly shows that NVIDIA’s OpenGL implementation has significantly better CPU batch behavior, to the point where batch size is clearly not the dominating factor in performance.

Bindless is, first and foremost, about exploiting NVIDIA hardware, providing access to lower-level aspects of what their hardware can do. Obviously lower-level APIs will be faster than higher-level ones.

Of course, this ‘hack’ could and should be core

No it shouldn’t.

Bindless as far as uniforms are concerned is very hardware-specific. You’re not going to be able to generalize that much beyond NVIDIA’s hardware. Bindless vertex rendering might be possible, but even then, it’s iffy. It’s mostly about making buffers resident, and what restrictions you place on that. In NVIDIA-land, a resident buffer can still be mapped and modified; do you really want to force everyone to be able to do that?

In either case, we should not be giving people integer “GPU addresses” that they offset themselves. That’s way too low-level for an abstraction. It should simply be a name for a buffer that has been made resident.

Eosie · May 11, 2012, 5:25pm

[QUOTE=Alfonse Reinheart;1237469]No, NVIDIA’s bindless APIs is clear proof that OpenGL can be improved with regard to CPU caching performance. “cannot perform well as far as the overall CPU overhead is concerned” is errant nonsense; OpenGL is significantly better than D3D in this regard, as it allows the driver to do draw call marshalling.

Just check out NVIDIA’s 10,000 draw call PDF. Page 14 clearly shows that NVIDIA’s OpenGL implementation has significantly better CPU batch behavior, to the point where batch size is clearly not the dominating factor in performance.[/QUOTE]

Oh please, are you kidding me? It’s a PDF comparing GeForce 2/3/4/FX performance on some ancient OpenGL implementation and DirectX 8 (I guess?) on some ancient version of Windows. You would very surprised how much APIs, OS driver interfaces, drivers, and hardware have evolved since then. It’s a whole new world today.

Anyway, the point is the existence of the “bindless” extensions shows how desperate the driver developers are. They are obviously very aware that the whole ARB won’t agree on a complete 0penGL rewrite unanimously, so they had to find another way. I don’t blame them, it’s logical. However if the OpenGL API could be reworked such that it achieves at least 80% of performance increase of what bindless APIs advertise, I’d call it a huge win.

Alfonse_Reinheart · May 11, 2012, 6:01pm

You would very surprised how much APIs, OS driver interfaces, drivers, and hardware have evolved since then. It’s a whole new world today.

And which of these changes would suddenly cause cache miss rates to increase? Obviously, if something has changed that would cause performance to degrade, you should be able to point to exactly what it was.

Obviously, modern CPUs outstrip the speed of memory to feed them by significantly more now than they did then. Thus cache misses hurt proportionately more nowadays. But that doesn’t invalidate the previous data. It simply means that there are additional concerns besides batch size.

Batches still count, especially in D3D land.

The fact that bindless provides a remarkable speedup alone is not evidence that OpenGL performance has degraded. After all, for all you know, that level of performance speedup could have been possible back then if bindless vertex rendering had been implemented.

Anyway, the point is the existence of the “bindless” extensions shows how desperate the driver developers are.

By your logic, the existence of NV_path_rendering would mean that driver developers are “desperate” to get 2D rendering via OpenGL.

The problem with your claim is the incorrect assumption that NVIDIA == driver developers. If AMD and/or Intel had their own competing “bindless” specs out there, your claim might hold some weight. But as it stands, no; the absolute best you can conclude is that NVIDIA is “desperate”.

Another point against this is that NVIDIA has shown a clear willingness to work with others on EXT extensions to expose shared hardware features, like EXT_shader_image_load_store. Indeed, EXT_separate_shader_objects was basically all NVIDIA’s spec, with a bit of consulting with the outside world (the ARB version is what happens when a committee comes along and ruins something that wasn’t terribly great to begin with). And yet, both of those are EXT extensions, not NVs.

Coupled with the possible patent on bindless textures, it’s much more likely that NVIDIA is simply doing what NVIDIA does: expose hardware-specific features via extensions. That’s what they’ve always done, and there’s little likelihood that they’re going to stop anytime soon. Bindless isn’t some “desperate” act to kick the ARB in the rear or circumvent it. It’s just NVIDIA saying “we’ve got cool proprietary stuff.” Like they always do.

mhagain · May 12, 2012, 6:22am

D3D10 or 11 no longer have the old draw call overhead; that’s been dead for positively ages. D3D9 on Vista or 7 also shares this performance characteristic.

Every hardware vendor’s marketing department would prefer you to be using their proprietary stuff. That was the entire motive behind AMD’s fit of “make the API go away” a while back. It’s a trifle disingenuous to make AMD look blameless in this.

The key important thing here is not NV bindless vs some other hypothetical implementation of same - the key important thing is addressing the bind-to-modify madness that has afflicted OpenGL since day one. There’s absolutely nothing proprietary or even hardware-dependent about that; D3D doesn’t have bind-to-modify, AMD hardware can do it, Intel hardware can do it.

It’s not good enough to release a flawed, wonky or flat-out-insane first revision of a spec (buffer objects, GL2) and hope to patch it with extensions later on. Even in the absence of significant new hardware features (which it is by no means safe to predict) future versions of GL must focus on removing barriers to productivity.

Janika · May 12, 2012, 11:53am

API design in general should be an abstraction driven by usability rather than how hardware can support some features or how it can be optimal for the hardware…Hardware changes, and it’s unpredictable how it will change. The API designers should rather focus on how the API should be used instead of how it’s implemented. Usability and elegance should take first priority here.
Anyway we can wait till September this year and see. There will be some big changes

Alfonse_Reinheart · May 12, 2012, 4:36pm

Every hardware vendor’s marketing department would prefer you to be using their proprietary stuff. That was the entire motive behind AMD’s fit of “make the API go away” a while back. It’s a trifle disingenuous to make AMD look blameless in this.

It wasn’t a “fit”; it was a comment. A “fit” would have been many comments over a long period of time. And considering the fact that they didn’t do anything about it, it obviously wasn’t a grave concern for them. Simply stating a fact.

Actions speak louder than words. And AMD’s actions aren’t saying much.

The key important thing here is not NV bindless vs some other hypothetical implementation of same - the key important thing is addressing the bind-to-modify madness that has afflicted OpenGL since day one.

But… that has nothing to do with the problem that NVIDIA’s bindless solves.

Bindless doesn’t get its performance from removing bind-to-modify. Indeed, it doesn’t remove this at all. The problem isn’t bind-to-modify. The problem is that binding an object for rendering requires a lot of overhead due to the nature of the objects themselves. Objects in OpenGL are not pointers to driver-created objects. They’re references to pointers to driver-created objects. That’s two levels of indirection and thus more opportunities for cache misses.

Bindless goes from 2 indirections to zero by directly using a GPU address. D3D has (possibly) one fewer indirections. Removing bind-to-modify would go from 2 indirections to… 2. Because it doesn’t address the number of indirections.

To reduce the number of indirections, you have to deal with the objects themselves, not how you modify them. OpenGL objects would have to stop being numbers and start being actual pointers. You would have to be forbidden to do things like reallocate texture storage (which we already almost completely have with ARB_texture_storage), reallocate buffer object storage (ie: being able to call glBufferData more than once with a different size and usage hint) and so forth.

The fact that you’d be using glNamedBufferData instead of glBufferData to reallocate buffer object storage does nothing to resolve the underlying problem. The driver has to check to see if the buffer object has changed since the last time you talked about it. It has to resolve the buffer object into a GPU pointer, which also may mean DMA-ing it to video memory. And so forth.

These checks have nothing to do with bind-to-modify. Getting rid of bind-to-modify will not make OpenGL implementations faster.

mhagain · May 12, 2012, 9:35pm

Bind-to-modify pollutes GPU and driver caches, it pollutes state-change filtering, it affects design decisions in your program, it pollutes VAOs. This isn’t some theoretical “more opportunities for cache misses” thing, this is “cache misses do happen, as well as all of this other unnecessary junk”. An extra layer of indirection is down in the noise of any performance graph compared to this.

We’re not talking about drawing teapots on planar surfaces here. We’re not talking about loading everything once at startup and using static data for the entire program here. Real-world programs are very dynamic. Objects move into and out of the scene, transient effects come and go, on-screen GUI elements are updated and change, and this happens in every single frame.

Getting rid of bind-to-modify will make GL programs go faster.

Alfonse_Reinheart · May 13, 2012, 6:02am

Bind-to-modify pollutes GPU and driver caches

I cannot imagine a sane implementation of OpenGL that actually causes GPU changes from the act of binding an object. Unless it’s a sampler object, and even then it’s kinda iffy. Driver caches being polluted is far more about the extra indirection.

If I do this (with a previously created buf):


glBindBuffer(GL_UNIFORM_BUFFER, buf);
glBufferSubData(GL_UNIFORM_BUFFER, ...);

The first line will set the currently bound buffer object to refer to buf. That involves changing an effectively global value. There may be some memory reads to check for errors, and the actual object behind buf may need to be tracked down and allocated if it doesn’t exist. So, that’s a hash-table access to get the actual object behind buf, followed by some reads of the buffer’s data (to see if it exists), followed by a memory write to the global value.

That’s 3 memory accesses. And let’s assume they’re all uncached. So 3 cache misses.

The second line will do some uploading to the buffer. So we fetch from the global value the buffer object. We’ll assume that this implementation was not written by idiots, so the global value being written was not buf, but a pointer to the actual internal buffer object. So we access the global pointer value, get the buffer’s GPU and/or CPU address, and do whatever logic is needed to schedule an upload.

That’s 2 memory accesses (outside of the scheduling logic). However, fetching the global pointer value is guaranteed to be a cached access, since we just wrote that value in the glBindBuffer call. Also, we may have brought the buffer object’s data into the cache when we did those bind-time validation checks. So worst-case, this is only 1 cache miss. Best case, 0 misses, but let’s say 1.

Total cache misses: 4. Total number of different cache lines accessed: 4.

Now, let’s look at this:


glNamedBufferSubDataEXT(buf, ...);

So, first we need to resolve buf into a buffer object. That requires accessing our hash table to turn it into a pointer to our internal buffer object data. Following this, we must check to see if this is a valid object. After that, we have to get the buffer’s GPU and/or CPU address, and do whatever logic is needed to schedule an upload. That’s 3 memory accesses.

Total cache misses: 3. Total number of different cache lines accessed: 3.

So, the difference is 4:3 in both number of cache misses and how many separate lines are accessed. Fair enough.

Now, let’s look at what happens when the hash table is removed and we deal directly with opaque pointers.

The first one goes from 4 cache misses down to 3. The second goes from 3 down to 2. So, that “extra indirection” seems to be a pretty significant thing, as removing it reduced our number of cache misses by 25% in the bind-to-modify case and by 33% in the DSA case.

DSA alone only reduces cache misses by 25%.

But wait; since we’re dealing with “GL programs”, we need to consider how often these cache misses will happen. How often will a bind point actually not be in the cache?

Obviously the first time you use a bind point in each rendering frame, it will not be in the cache. But after that? Since so many functions use these bind points, the number of cache misses is probably not going to be huge.

What about the cache hit/miss rate for the indirection, the hash table itself? That is in fact rather worse. Every time you use a new object (by “new”, I mean unused this frame), that’s a likely cache miss on the hash table. That’s going to be pretty frequent.

As you say, “Real-world programs are very dynamic.” You’re going to have thousands of textures used in a frame. You may have a few dozen buffer objects (depending on how you use them). You may have hundreds of programs.

So which miss rate is going to be higher: new object within this frame? Or bind points?

My money is on new objects. So getting rid of that indirection ultimately gets you more. So this:

An extra layer of indirection is down in the noise of any performance graph compared to this.

doesn’t seem to bear up to scrutiny.

Real-world programs are very dynamic. Objects move into and out of the scene, transient effects come and go, on-screen GUI elements are updated and change, and this happens in every single frame.

This seems like a non-sequitor. Objects moving out of a scene is simply a matter of what programs and buffers you bind to render with. It has nothing to do with what object state gets modified (outside of uniforms set to those programs, but I’ll get to that in a bit).

GUI elements and particle systems are a matter of buffer streaming. The predominant performance bottleneck associated with that (when you manage to find the fast path) is in the generation of the new vertex data and in the uploading of it to the GPU. The extra time for binding that buffer in order to invalidate and map it is irrelevant next to issuing a DMA. So those don’t count.

As for uniforms within programs, these are set all the time. It’s not that you necessarily reset all the uniforms every time you render a new object of course. The fact is that uniform setting is a constant operation, something you do all the time when rendering.

Indeed, uniforms are probably the object state most likely to be modified when rendering. And that last part is the key: when you change uniform state, it is only because you want to render with that program.

It’s not bind-to-modify with programs; it’s bind-to-modify-and-render. Modifying uniforms without binding the program doesn’t change the fact that when you change uniforms, you’re also almost certainly going to immediately render with it. Which means you need to bind the program. So you’ve lost nothing with bind-to-modify for programs; you were going to bind it anyway.

it pollutes state-change filtering

In what way? No OpenGL implementation would actually trigger any GPU activity based solely on binding. Except for sampler objects, and even then, probably not.

The only time it would affect filtering is if you are in the middle of the render loop, and you do something stupid. Like bind an object, modify it, and then bind something else over top of it.

And I can’t think of a sensible reason to do this in a real program.

Getting rid of bind-to-modify will make GL programs go faster.

There is strong evidence that bindless has a significant effect on GL programs. Where is your evidence that bind-to-modify has a similar effect?

Yes, it will probably have an effect. But I seriously doubt it’s going to be anywhere near as what you get with bindless.

xahir · May 17, 2012, 1:33am

I really don’t know if binding or bindless should be faster or indifferent. But one thing I’m sure about is nvidia drivers are running on another thread and whatever call I make is being routed to the driver’s thread(s) to be served. Getting the number of calls down should count.

Besides everything, in my opinion one of the best outcomes of binding gl objects is that driver ensures any previous commands regarding the object is complete.

And to be honest DSA is only-really good for wrapping opengl objects with language constructs. Lets say you want to make a Texture class, there is almost no way to do it without DSA. For example Texture.SetImage(…) method. For each call of such a method you need to make sure your texture is bound and you need to restore previous texture binding etc. that is kind of painful without DSA (using a Context object which ends up being sphagetti:)) Do you need such language constructs? Well it is easier for migrating/mirroring DX code, and it is easier to think of the code especially if you are educated like me…

Janika · May 17, 2012, 7:37am

From a software engineering perspective, DSA is just ugly. It might be faster or easier to implement, but it has nothing to do with how the hardware works. Whether it works faster is something implementation dependent, and is determined by how much the implementation does not suck Remember the situation with VBO? Theoretically it should outperform the old drawing approaches, but in reality it’s completely different. Direct does not guarantee practical performance gain.

Alfonse_Reinheart · May 17, 2012, 8:08am

And to be honest DSA is only-really good for wrapping opengl objects with language constructs.

And not polluting the global bind points just because you want to change something. And for good programming practice (ie: not using global variables to pass invisible parameters around, which causes all kinds of errors).

From a software engineering perspective, DSA is just ugly. It might be faster or easier to implement, but it has nothing to do with how the hardware works. Whether it works faster is something implementation dependent, and is determined by how much the implementation does not suck

Nonsense. Implementations have to not suck to be fast, but that doesn’t mean that the API has zero effect on performance. Bad APIs can very easily induce performance difficulties.

DSA does match with how hardware works because objects are objects in the hardware. Textures are pointers to various GPU memory blocks plus some state. These are actual objects with actual GPU state. Modifying these objects has nothing to do with the context or context state. The OpenGL bind-to-modify abstraction is 100% wrong compared to the actual underlying implementation.

Remember the situation with VBO? Theoretically it should outperform the old drawing approaches, but in reality it’s completely different.

Huh? The only “old drawing approach” that outperforms buffer objects (on static data) are display lists, and even the only on certain hardware. As for dynamic, the main problem here isn’t the object itself, but the difficulty in finding the fast path for doing streaming of buffer objects.

Which, incidentally, is a very good example of how a poor API can ruin performance. A good API would prevent you from taking the slow path altogether.

Janika · May 17, 2012, 10:55am

Nonsense. Implementations have to not suck to be fast, but that doesn’t mean that the API has zero effect on performance. Bad APIs can very easily induce performance difficulties.

Do you even have an idea how drivers work?

The OpenGL bind-to-modify abstraction is 100% wrong compared to the actual underlying implementation

Nonsense. You clearly underestimating the whole idea behind binding. It’s a bigger problem than you think.

Which, incidentally, is a very good example of how a poor API can ruin performance.

I agree. DirectX is a very good example.

aqnuep · May 17, 2012, 1:39pm

And do you have any idea how drivers work? Actually, what VBOs abstract is the only thing that the hardware actually can do. That means, all the rest, including immediate mode, client side vertex arrays and display lists are all emulated through VBOs, thus means it involves an additional copy. If ever VBOs are slower than the other alternatives is because you, as the developer, choose worse data layouts than the driver internally choses when creating the “VBO” for emulating the old mechanisms.

No, actually Alfonse got it pretty well. Binding usually does not do anything in practice, except setting some global state. This is because at this point a driver might not know whether this bind happened because of an upcoming state change or because the application actually wants to bind the resource for use by the GPU. Of course, this is not the only approach and drivers might handle it differently for different object types and also various implementations might use different policies. However, a bind is not required by the hardware to modify object state. This is a 100% wrong abstraction of the underlying hardware in all cases, as Alfonse said.

DirectX? Which D3D are you talking about? D3D9, which is deprecated or modern D3D, i.e. D3D10+? D3D10 and D3D11 reflect better how the hardware works (even though it’s not perfect) and that’s what more or less OpenGL 3+ core + DSA does. Obviously, D3D9 suffers from many of the same issues like legacy OpenGL, but why bring up the example of a deprecated API?

system · May 17, 2012, 3:55pm

You mean that DirectX should have bind to modify like OpenGL has? Are you nuts?

Direct3D is a pretty straight to the metal as you can get. D3D8 blew away GL. D3D9 was even better although the shaders made it difficult for the IHVs to optimize, but GL’s asm shaders were in the same situation.
GLSL was nice since it was a high level language and provided more optimization opportunities for the IHVs (eventually, D3D got its HLSL) but GLSL had a lot of suckiness which eventually GL 3.0 and above addressed.

Although a lot more can be said, I’m going to stop here.

Janika · May 17, 2012, 3:56pm

If ever VBOs are slower than the other alternatives is because you, as the developer, choose worse data layouts than the driver internally choses when creating the “VBO” for emulating the old mechanisms.

Then please you, as a driver developer, provide us, the developers, what layout your driver likes. Though I still don’t get your point here. How on earth do you expect millions of developers starving for performance chose the data layout your driver expects? Are you telling me that driver developers restrict the driver to work best with only one data layout?
These are not professional coders.

No, actually Alfonse got it pretty well. Binding usually does not do anything in practice, except setting some global state. This is because at this point a driver might not know whether this bind happened because of an upcoming state change or because the application actually wants to bind the resource for use by the GPU. Of course, this is not the only approach and drivers might handle it differently for different object types and also various implementations might use different policies. However, a bind is not required by the hardware to modify object state. This is a 100% wrong abstraction of the underlying hardware in all cases, as Alfonse said.

Your prophet got it right, we cannot argue with that anymore. Could you please stop being 100% sure all the time? If it’s 100% wrong abstraction layer, then why has OpenGL maintained this approach for so many years?

D3D10 and D3D11 reflect better how the hardware works…

Then it’s a bad API You just said it. This simply means poor abstraction level.

mhagain · May 17, 2012, 4:24pm

Have you ever actually written any modern D3D code? Even at version 9 - once you rip out all the fixed pipeline rubbish - it’s actually a much cleaner and easier to use API than OpenGL. True, with D3D9 there were still some parts of it that looked as though they were designed by a tribe of monkeys on LSD (queries, lost devices, weird float to DWORD casts), but they’re all gone or resolved in 10 and 11. It’s clean, it’s easy to use, it’s fast to get stuff done in, you can be genuinely productive with it.

The core problem with bind-to-modify is that it messes up internal states. You want to modify a resource, you need to bind it, do your change, then put it back the way it was. The concept of binding doesn’t differentiate between whether you’re binding to modify or binding to draw, and that’s an AWFUL abstraction; it doesn’t reflect the way you actually USE resources in your program. Modifying a resource and using it to draw are two very different things, and should not be treated as if they were the same. It’s even worse with buffer objects as there is just a single binding-point for each buffer object type. Suddenly the API is no longer helping you to be productive, it’s getting in your way instead. You’re no longer designing your code around the job you need your program to do, you’re designing it around a bunch of arcane rules from the paleolithic era. How could that possibly be a good thing?

aqnuep · May 17, 2012, 4:47pm

[QUOTE=Janika;1237653]Then please you, as a driver developer, provide us, the developers, what layout your driver likes. Though I still don’t get your point here. How on earth do you expect millions of developers starving for performance chose the data layout your driver expects? Are you telling me that driver developers restrict the driver to work best with only one data layout?
These are not professional coders.[/QUOTE]
It’s not the driver that expects proper layout but the hardware. You should learn or simply benchmark NVIDIA and AMD GPUs with various VBO layouts (alignment, stride, data format, component number, etc.). In general, don’t use non-4-byte-aligned data, also some older hardware might not like 3 component data. Also, there are some GL3 capable GPUs that even though support half floats, they don’t perform very well with half float vertex attributes. There are lot of aspects and, of course, some of them are non-trivial. Testing and benchmarking is usually the best way to reveal them.

Because OpenGL failed to remove any features even through deprecation. Also, no rewrite happened which could eliminate this. Finally, even though the DSA extension being nice, I can understand why it was not included in the core spec yet as it has many problems too.

I’m not objecting you just because I feel good about it, but because you mislead the readers of this forum.