Display Lists - The Next Generation (CBO)

Ilian_Dinev · April 23, 2010, 5:47pm

Here’s some ancient and incomplete for this thread bench I did:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=255602#Post255602

My experience is limited, as it’s just a hobby for me, and the actual data I work with isn’t much; the frames are easily gpu-bound. I sort by material_pass->mesh_buffer->instance, so there are few buffer-bind/vtxsetup calls per viewport per frame.

Alfonse_Reinheart · April 23, 2010, 6:07pm

You must admit that in an ideal world ATI/Intel would just sort out their dlist compilers (or in ATI’s case, just use the one from the fireGL drivers)?

No. In an ideal world, we would have one method for rendering that simply works as fast as possible with as great flexibility as possible.

The glVertexAttribPointer call’s the bottleneck (which you alluded to, but seemed to be suggesting it’s the bind followed by the attribptr call). That call is just as expensive whether you’ve changed the currently bound buffer or not.

Where is your evidence on that? In their discussion of why they implemented NV_vertex_buffer_unified_memory, NVIDIA specifically called out the cache issue of fetching the GPU address from the internal buffer object.

Furthermore, if glVertexAttribPointer is a problem, then wouldn’t it make more sense to simply divide the vertex format from the buffer object+offset the way that NV_vertex_buffer_unified_memory? After all, if there is both a cache problem and a glVAP problem, then that glVAP problem must have to do with the cost of changing vertex formats.

Rather than making some gigantic change that requires lots of IHV support to make work and may not actually help, just make small, targeted, specific changes that fix the specific problems.

And while I don’t like bindless for breaking the buffer object abstraction, I have to say, it is a very specific, targeted change to solve a specific problem.

llian, glDrawRangeElementsBaseVertex isn’t supported on older cards…unfortunately.

CBO won’t be supported on them either. And BaseVertex is supported on DX9 cards; at least, the ones that are still receiving driver support from the IHVs. It’s a feature that has been available on D3D for some time.

Dark_Photon · April 23, 2010, 8:37pm

No, you’re diluting this thread by confusing the issue.

Beating display lists isn’t the goal…

No, beating display lists isn’t the goal. Maximal application performance is!

If display list perf is the fastest route on one vendor for static pre-loaded geometry, then we will use it, no matter what you say! If another route is faster on another vendor, then we will use that! This is not rocket science. This is common sense. And switching rendering paths is easy. Only academia can afford to lean back and “settle” with lowest-common-denominator across all vendors. Performance and content sells – that’s reality.

So the fact that vendor X does a sad job on render path Y is a really dumb reason to say that path Y doesn’t matter.

Now (ignoring your destructive bashing), Aleksander’s point in starting this thread was a very good one: paraphrasing,

NVidia’s display lists provide huge speed improvements, but display lists are deprecated. How do we expose that speed-up in next-gen OpenGL?

and he proposed CBOs. …after which this degenerated into a food-fight over what the “cause” of the speed-up is, and (in some cases) how you can could kludge around those causes rather than fix the underlying problem(s).

This is something the vendors will have to decide. On NVidia in my experience, we have our modern “display list” perf solution: it is called the Bindless extensions. If other vendors want that perf to make their cards look good and compare well, they’ll implement it (or something like it) too. I don’t see a need for another layer of objects/abstraction on top of this (display lists, CBOs, etc.) but vendor driver internals may steer the shared solution to that. Again, the vendors will have to decide.

And here’s hoping they are working through the ARB to facilitate this discussion, so we can get one EXT or ARB extension from this, not 2-3 vendor-specific extensions. A point which you also agree with:

In an ideal world, we would have one method for rendering that simply works as fast as possible with as great flexibility as possible.

So I agree with Aleksander, and I’m glad he started this thread. Something is needed (API support) to fill this performance gap in a simple, cross-vendor way.

So far I’ve yet to hear a good reason why bindless (using 64-bit buffer handles, which just so happen to be GPU addresses on some hardware) isn’t “it”.

Alfonse_Reinheart · April 23, 2010, 10:43pm

So the fact that vendor X does a sad job on render path Y is a really dumb reason to say that path Y doesn’t matter.

It’s all a matter of effort vs. reward.

NVIDIA’s graphics card division is… well… things aren’t going well for them. They’re 6 months late with a DX11 card, the card they eventually released is not exactly shipping in quantity, it runs fantastically hot, etc.

ATI by contrast was able to ship 4 DX11 chips in 6 months, and they’re able to meet demand in selling those chips. They’re selling DX11 hardware to the mainstream market, while NVIDIA can’t even produce mainstream (sub-$200) DX11 cards after a 6 month delay.

One company is winning, and the other is losing.

The simple economic reality is this: development resources are not infinite. It’d be great if we could optimize everything, everywhere, for every piece of hardware. But what matters most is doing the greatest good for the greatest number. Adding a rendering path for display lists only helps NVIDIA card users; for most people, that means some percentage of their customer base less than 100%. This rendering path requires testing, debugging, and other care&feeding.

Or, one could spend my development resources tweaking shaders to make them faster and gain a performance benefit there. Alternatively, since performance is being lost anyway, one could make the game look better with the same performance. Maybe make the shaders more complex, or add in HDR or bloom, or whatever. Unlike the display list optimization, both of these will be useful for 100% of the customer base.

Where are the development resources better spent? On the slowly dwindling population of NVIDIA card holders? Or on all of the potential customers? Yes, it’d be nice if development resources could be spent on both. And for some, they can afford it; more power to them.

The rest of the developers would rather have a single path that both NVIDIA and ATI are willing to optimize as much as possible. Right now, that path is VBOs.

and he proposed CBOs. …after which this degenerated into a food-fight over what the “cause” of the speed-up is, and (in some cases) how you can could kludge around those causes rather than fix the underlying problem(s).

That’s how you see it, but that’s not what the actual discussion is.

First, identifying the cause of the performance increase from display lists or bindless is vital to determining how to actually achieve it. If the cause of the increase is not what was identified in the original post, then CBOs will not help! And proposing something that will not actually solve the problem is a waste of everyone’s time.

If you want to consider any discussion of whether CBOs will actually solve the problem to be missing the point, well, that’s something you’ll have to deal with yourself.

Second, “kludging” around the problem is more likely to solve it than inventing an entire new API. Bindless is nothing if not a gigantic kludge, yet you seem totally happy with it.

Something is needed (API support) to fill this performance gap in a simple, cross-vendor way.

This thread is not about “something” that solves the problem. It is not a thread for discussing arbitrary solutions to the problem. It is about a specific solution. A solution who’s efficacy is far from settled.

using 64-bit buffer handles, which just so happen to be GPU addresses on some hardware

Those are not 64-bit handles; they are actual GPU addresses. Even if you completely ignore the fact that the function is called glBufferAddressRangeNV and the fact that the spec constantly refers to them as “addresses”, glBufferAddressRangeNV doesn’t take an offset. So that 64-bit value must be referring to an address. Either the address returned from querying it or an offset from the queried value.

If it looks like an address, acts like an address, and everyone calls it an address, then it is an address. So please don’t act like bindless is something that could be trivially adopted by the ARB or something that doesn’t break the buffer object abstraction.

Aleksandar · April 24, 2010, 11:23am

I’m using glMultiDrawElements(). glDrawElements() would further increase number of function calls. Maybe it is true that glMultiDrawElements() just iterates through many glDrawElements() calls inside the driver, but I still believe that it is little bit faster than if I do the iteration. I have also mentioned that I’m using bindless, so binding a VBO is not critical any more.

Would it be faster if I have just one VBO that is not static, instead of many static VBOs? I’m not sure. But there is definitely a need to update some parts of it. There is also a problem of indexing such big and complex structure. Anyway, thank you all for useful advices! I have tried to draw attention to something else…

Obviously I chose a wrong example. Maybe next illustration would be better. Can anyone tell me why a single glCallLists() is faster than several independent glCallList() calls? I’ve changed the application so that it draws 65K DLs on two different ways. A single glCallLists() call is faster 60% than thousands of glCallList() calls. Measuring is done using QueryCounter (on GPU). Maybe answer to this question will help to be understood what I wanted to say.

Fugitive · April 25, 2010, 6:47am

@Alfonse: NVidia is ‘losing’? slowly dwindling population of NVIDIA card holders? Seriously? I use Both ATI and NVidia cards every day, and buy the latest ones every year or so, and I can tell you, its a one-up game for both of them. 6-months later, with Fermi, it could very well be ATI that seems to be ‘losing’. Neither of them are…

You entire section on the ‘reality’ of reward vs time is misplaced. Its true, but misplaced. The gaming industry has been customizing their render-paths to specific cards since the birth of the GPU era. There are only two major brands of cards to worry about: NVidia and ATI. For major game development companies (which produce perhaps 80% of the professional games?), one more programmer who works on tweaking the render paths for two brand of cards is not a big deal. Infact, its a competitive advantage.

Alfonse_Reinheart · April 25, 2010, 12:13pm

On bindless and buffer objects:

It’s clear that bindless achieves what it sets out to, at least on NVIDIA hardware. Thus, in order to decide how best to create a platform neutral extension that gives bindless performance without the bad parts of bindless, it stands to reason that the first step is to examine why bindless works. And that starts with the basic differences between rendering with bindless and rendering without it.

There are really only 2 differences between the bindless API and the regular one.

1: The division of vertex format (type, normalization, stride, etc) from GPU location (buffer + offset). In bindless, these are set by different API calls, whereas in regular GL, they are not.

2: The explicit locking of buffer objects, which prevents them from being moved. This also means that the buffer has an explicit address for the duration that it is bound.

NVIDIA did not have to do #1. Indeed, they went out of their way to do #1 in the implementation: they added a bunch of new functions just to add this functionality. This suggests that, for NVIDIA hardware/drivers at least, performing a glVertexAttribArray/Format call is expensive. Indeed, in the bindless examples, they specifically minimize this kind of state change. Setting the GPU address is done much more frequently.

And this makes some degree of sense for the user. Vertex formats don’t change nearly as frequently as what buffer object + offset you use. Indeed, you could imagine some applications that only use maybe 7 or 8 vertex formats per frame, if that. Indeed, with clever enable/disable logic, one imagines that you could setup a vertex format once, and pretty much never change it (though if you’re making heavy use of attributes, this may not be possible).

So just from analyzing how bindless changes rendering, we can already see something that the ARB should be looking into: separating vertex formats from buffer object+offset.

I would suggest a VFO: vertex format object. This should work similarly to the way that sampler objects work: if a VFO is bound, it overrides the equivalent VAO settings. Like sampler objects, it should be DSAified by nature; binding only to use.

This would also require adding an API to assign a buffer object+offset to an attribute. While this data would be VAO data, it should probably not be DSAified.

NVidia is ‘losing’? slowly dwindling population of NVIDIA card holders? Seriously? I use Both ATI and NVidia cards every day, and buy the latest ones every year or so, and I can tell you, its a one-up game for both of them. 6-months later, with Fermi, it could very well be ATI that seems to be ‘losing’.

As of right now, what I said was true. I made no claim to say that this would continue in perpetuity. Only that, right now, ATI’s cards are selling better than NVIDIAs.

My point was that you can’t ignore ATI. During the embarrassing years of the R520 and R600, there was a reasonable case to be made for ignoring ATI cards. That case cannot currently be made.

As an aside, you may want to investigate the Fermi problems (and other NVIDIA/TMSC manufacturing problems) in more depth. It’s very fascinating, and it does not offer a rosy outlook for NVIDIA in the near term. NVIDIA might come up some mainstream Fermi-based cards that trump ATI’s Evergreen/Southern Isles stuff. But NVIDIA’s problems with manufacturing really don’t suggest that this is likely.

For major game development companies (which produce perhaps 80% of the professional games?), one more programmer who works on tweaking the render paths for two brand of cards is not a big deal. Infact, its a competitive advantage.

I don’t believe I mentioned game developers. However, not all game developers are equal. And “one more programmer” is a lot of money for many game developers. If your project has 7 people on it, adding 1 more is pretty substantial.

Pierre_Boudier · April 25, 2010, 12:26pm

there are several possible answers:

the display list optimizer can do a better job inside one DL than across several DL. (for instance, it can remove unused attribute, it can reindex, …)
or if you are CPU bound, then writting a few dwords is faster for the driver than to write a thousand times a few dwords

Aleksandar · April 25, 2010, 12:49pm

Indeed, but glCallLists() does not create a single display list; just iteratively calls separate DLs.

Completely agree. That’s one of the reasons I’ve proposed CBO.

This thread becomes both NV vs. ATI and DL vs VBO fight. It was not my intent. I just wanted to ask the community if command buffer object (or whatever its name would be) could be beneficial for boosting rendering speed. So far, it seems that the community is not interesting in such extension.

Alfonse_Reinheart · April 25, 2010, 12:54pm

That’s one of the reasons I’ve proposed CBO.

And that’s the problem: it only solves that particular issue. It has most of the limitations of display lists and a lot less optimization potential. It doesn’t guarantee optimal performance, just like display lists. It can’t even offer display list performance in the example you yourself used.

So it’s not a good idea.

Aleksandar · April 25, 2010, 1:06pm

I don’t understand what optimization you are talking about. Data is already stored in VBOs. Commands can be compiled (and optimized by reordering) into separate buffer, and called as a batch in just one call. Where is the problem?

Of course, I’m unaware of all possible problems, and that’s the reason I’ve started this debate. All suggestions are welcome.

Fugitive · April 25, 2010, 2:05pm

No, your point was that you can ignore NVidia because they are doing poorly anyway. Its not reasonable to ignore NVidia because of their current performance as this may just be a temporary turn, as it usually is. BTW, I already know about problems with Fermi. Haven’t we seen ATI struggle with similar problems in the past?

I don’t believe I mentioned game developers. However, not all game developers are equal. And “one more programmer” is a lot of money for many game developers. If your project has 7 people on it, adding 1 more is pretty substantial.

You are right, you didn’t mention game developers. I also agree that on smaller projects, adding one more person is substantial. However, the reality still is that:

Most professional/commercial software is made by large corporations.
By consequence, commercial software will continue to support multiple render-paths. I.e, optimize for each card separately since one more developer is not as much a cost as losing out to the competition with a sub-optimized product on a particular GPU.

peterfilm · April 25, 2010, 3:01pm

Aleksandar, they re-order or optimise whole frames of display lists. Telling the driver you want to draw a big contiguous block of display lists in a single call is gold dust to them. It gets sent to a worker thread which re-optimises the whole thing, given a batch id, and the next frame that optimised block is used instead. (speculation based on some observations).

Alfonse_Reinheart · April 25, 2010, 3:12pm

I don’t understand what optimization you are talking about. Data is already stored in VBOs. Commands can be compiled (and optimized by reordering) into separate buffer, and called as a batch in just one call. Where is the problem?

And this is the problem. You believe that the problem is function call overhead. That each function call itself is necessarily creating a noticeable performance drop. That it doesn’t matter which functions you call thousands of times per frame.

A display list is free to the following that CBOs cannot:

1: Put all of the mesh data into a single buffer.

2: Be directly tied to this buffer, so that when it is moved, the display list is notified.

3: Analyze the mesh data and modify the vertex format for optimal performance (interleaving, etc).

4: Minimize vertex format state changes during rendering.

And that’s just what I came up with off the top of my head.

You specifically stated, “The new DLs (or CBOs) would be just an efficient way to draw VBOs.” This means that CBOs must be using the same buffer objects that were used when compiling them. So there is no chance for format changes or reording or anything.

No, your point was that you can ignore NVidia because they are doing poorly anyway.

My point was that you can’t let NVIDIA alone guide your decision making about where to spend your money. NVIDIA-specific optimizations are reaching less of your customer base.

peterfilm · April 25, 2010, 3:48pm

Display lists are fine. No need to change them. Restrict what can be compiled into them, and maybe ATI will produce a better implementation for consumer cards (doubt it, as I said, I believe they’ve deliberately crippled dlists on non-workstation cards in order to sell more workstation cards).

I love the way display lists ‘describe’ a frame of drawing ops. If your scene is basically static, the whole thing can be a single ‘display list’ as far as the driver is concerned. That could include compiling all display lists called contiguously without intermittent state updates into a single display list. So it goes beyond the compile stage of the display list mechanism - the draw part is also easier to optimise.

None of this would be possible with your description of CBO’s. But change your description to “content is dereferenced at compile time” and you’re in business again. But then that’s display lists you’re describing. Or at least, display lists as most people use them (as geometry display lists, not any of the state change stuff).

Dark_Photon · April 25, 2010, 7:02pm

NVidia was pretty blatent about that, as you well know. Buffer handle->addr lookups causing CPU cache pollution. CPU-side inefficiency.

There are really only 2 differences between the bindless API and the regular one.

1: The division of vertex format (type, normalization, stride, etc) from GPU location (buffer + offset). In bindless, these are set by different API calls, whereas in regular GL, they are not.

2: The explicit locking of buffer objects, which prevents them from being moved. This also means that the buffer has an explicit address for the duration that it is bound.

NVIDIA did not have to do #1. Indeed, they went out of their way to do #1 in the implementation: they added a bunch of new functions just to add this functionality.

If we ignore the legacy, deprecated vertex attributes (as you usually do), then there is only one new API for that:

glVertexAttribFormatNV

In general:

glVertexAttribPointer = glVertexAttribFormatNV + glBufferAddressRangeNV

So yes, they separated the set of the vtx attr format from the set of the buffer address. Consequently, they can use the same API to “bind the buffer” via address (glBufferAddressRangeNV) just as we use the same API now to “bind the buffer” via buffer handle without bindless (glBindBuffer).

So for a modern OpenGL app that uses new-style vertex attributes, there are only these 2 new APIs that matter (glVertexAttribFormatNV and glBufferAddressRangeNV).

This suggests that, for NVIDIA hardware/drivers at least, performing a glVertexAttribArray/Format call is expensive.

Well, I can’t speak for all hardware, but I can tell you I tried doing both lazy sets of the vtx attr format with bindless vs. setting it every time regardless (both via glVertexAttribFormatNV of course), and so basically no significant difference. I tried this on systems with two different latest-gen CPUs/CPU mem/MB combinations, one slow, one moderately fast (2.0/2.6 GHz Core i7s).

Nearly all (> 98%) of the benefit to be had is just through using buffer handles vs. buffer addresses. Nothing to do with lazy setting vtx attr formats.

Indeed, in the bindless examples, they specifically minimize this kind of state change.

Yeah, I noticed that. Weird.

So just from analyzing how bindless changes rendering, we can already see something that the ARB should be looking into: separating vertex formats from buffer object+offset.

That’s premature, unless it just makes the API cleaner, which in bindless’s case it seems to and apparently doesn’t matter to perf, which seems to be the case.

Dark_Photon · April 26, 2010, 4:38am

Yeah, you work for them, we get it. Marketing dept?

This “was” a technical discussion about how to expose display list perf in next-gen OpenGL. Let’s get back to that… And stop bashing people…

Dark_Photon · April 26, 2010, 4:54am

It’s not that there’s no interest. It’s just that it’s another object (like display lists) in the driver, that needs to be created and managed. If that’s the fastest/cleanest approach, great, but do we really need this level of abstraction?

…from what I’ve seen (using geometry-only display lists), the underlying gain is almost purely from switching to buffer handles to buffer addresses.

We maintain buffer handles now. No reason we can’t maintain buffer addresses (or 64-bit handles) in addition in the same app data structures especially when it buys you so much perf. Ignoring Alfonse’s senseless bashing, I haven’t heard a good reason why this “isn’t” a good idea.

…the only reason I see for needing another level of abstraction is if the underlying concept of “GPU buffer addresses” is purely an “NVidia quirk” and not the way other GPUs work.

Then maybe we go with CBOs or some abstraction which can pre-resolve buffer handles to addresses once and then reuse them many times.

Alfonse_Reinheart · April 26, 2010, 11:16am

Ignoring Alfonse’s senseless bashing, I haven’t heard a good reason why this “isn’t” a good idea.

Because it breaks the fundamental abstraction of buffer objects.

It’s the same reason the ARB used VBOs (which behaves similarly to ATI_vertex_array_objects) instead of NV_vertex_array_range.

Aleksandar · April 26, 2010, 11:43am

I really doubt it works on that way. In a glCallLists, through ID vector you can pass any combination of DLs IDs and it works perfectly. I think that the driver has no time to reorganize anything during that call. It happens in a fraction of millisecond.

Of course that you shouldn’t change the layout of VBO without recompiling CBO, but you can change the content of the buffer without affecting CBO.

Dereferencing in the compile time can totally remove need for bindless, because CBO would manage physical addresses. I don’t understand one thing: why VBOs used with bindless must be made resident every time we change their content? The layout of the buffer does not change. The size is the same. Why should we do all the stuff of getting physical adresses and making buffers resident if we just want to change the content?

Anyway, I think that the major advantage of CBO over DL is that it decouples commands from the data. DL cannot be modified. VBO can! By modification I don’t think changing the size or the vertex format, just the content. If we want to chnage the layout of the vertex buffer or a vertex format, CBO should be recompiled. But even a recompilation can be optimized.

Something about Bindless: Recent experiments with my (“more optimized”) application showed that maximum overall speedup using Bindless is slightly below 50%. I can post charts for three NV cards. A very interesting phenomenon is that Bindless is slower for small scenes than ordinary VBOs, and have a jump for the middle-range scenes. Of course, it suggests that finding bottleneck is not easy, because in most cases it is distributed across different stages of execution. My application is definitely a CPU/driver bound and some kind of command buffer could be beneficial for boosting the speed.