how to manage states in ng ogl

l_belev · February 25, 2015, 2:26am

The classic model is separate states while the d3d11+ model is with the states grouped in several packets, which are immutable after creation.
From my experience i am convinced that the d3d11+ model is the very wrong thing to do.
At first glance it appears to be closer to the hardware (hardwares often do group states in one way or another) and minimizing the expensive validations, so it should be good.
The reality is quite different though.

Consider 2 points of view: the application perspective and the driver perspective.
First for the applications (games) the state grouping of the api is practically random and so its not useful at all.
Most of the time games need to change several states which according to the api are not related, and when the state groups are considered, no single one of them is fully changed (that is, for all related groups, some of their states need to be changed while others don’t).
So the game engines end up implementing a state-group-object caching layer by hashing that really implements the classic opegl/d3d9 style api over the d3d11.
This layer could just be part of the api itself without any drawback (there is no additional knowledge that the engines posses over the api, that could make this layer more efficient when implemented by the engine instead of the api).

From the driver perspective. D3D11’s grouping does not map to any existing hardware. The real hardwares have different ways of combining the states. This means that the drivers have no choice but to do again state remapping with a hashing and caching layer. This is exactly the same work as if the states were just separate as in the classic model. Note that it is not possible to choose the api groups in such way that will map well to the hardware because all hardwares are different.

So in the end of the day the api’s artificial state groupings are not only useless but are worse because in practice they inflict 2 state remappings (one by the game engines and another by the drivers).
What would be most convenient for the application is the classic model, while for the drivers the classic and the d3d11 models are equally good (drivers have to re-map states in all cases).

So my suggestion is: keep the classic model of separate states. Don’t attempt to device artificial state groups that can’t possibly map to all hardware while being awkward to use for the applications.

mhagain · February 25, 2015, 3:55am

The following description of D3D12 states is probably relevant here: Intel Developer Zone

Under this model there is a single state object combining all (well, “most” really, but “all” will suffice for this discussion) states used for a draw. So what that means is that the driver (and the program) needs no caching, no dirty bits, no hashing and all validation of all combined states is done up-front and at creation time. At draw time all that the driver needs to do is swap in the new states.

IMO this is also a valid model. Where the D3D 11 model didn’t work well was that it tried to hit a middle ground between this and the old “lots of separate states” model, and ended up suiting neither program flexibility nor optimal driver performance. Of course it’s still potentially an “awkward to use in programs” model, but whether or not that’s an acceptable tradeoff for the improved performance it will give is a matter of personal preference.

l_belev · February 25, 2015, 9:44am

In my first post where i wrote d3d11+ i actually meant d3d10+ (that is d3d10 and d3d11) but i doubt anyone got confused by that

I forgot to mention about the “expensive” validations. First they are not THAT expensive as some make them to be (many of them can be reduced to 2 CPU instructions or so as usually the valid parameter range is from 0 to some positive integer and most often they are independent, so checks for strange combinations are rarely needed).
Second, i assume the new opengl would follow the mantle validation model, that is, a separate layer that does the checks and which is normally turned off for shipped products, so the runtime checks are non-existent on the user machines.

As for the single-object-to-rule-them-all model in d3d12, i don’t like it neither. The applications will be forced again to implement their caching layers, which could instead be done just once and for all in the api (or the driver). The difference is in the validation checks (which for separate states will be more), but since they should be completely turned off for the release versions of the applications (as i mentioned above), it doesn’t matter.
As for the drivers, they will need to keep a list of their hardware state groups that are associated with each of the big api state objects, and on change they will need to check each of the hw state groups that are associated with the new api object, whether it is the same as the current one or not, so their work wont be zero too. If they don’t do that, they will have to re-set ALL the hw state groups EVERY time when even a single state has been changed, which may be overkill.
If the caching layer is instead implemented by the driver (instead of the application), this work will be eliminated because the driver will map from separate states directly to it’s true hardware state groups, and there will be no interaction between independent groups. When single state is changed, only it’s group will need to be updated and the rest won’t be touched at all.
Overall this sounds to me like microsoft are again trying to “innovate” the same way they used to in d3d10, that is, without really having any extensive insight on the matter.

Alfonse_Reinheart · February 25, 2015, 11:00am

The three possibilities here are:

Big, monolithic, immutable (or at least with controlled mutability) context object. From what I understand, this is the D3D12 model, so let’s call this PSO.
Each individual piece of user-facing state is its own entity. This is (effectively) the current OpenGL model, and it seems to be what you’re suggesting.
Group sections of state into immutable state objects. This is the D3D10 model.

Each one has its strengths and weaknesses.

PSO

The PSO model hides everything, allowing the driver the full freedom to do what it wants. Each immutable PSO can be stored in whatever way is optimal. And each PSO change will have a known, fixed cost to it. PSOs also are integral to making display lists (of generic state) work, since a PSO change is a big, monolithic entity.

The biggest drawback is that each PSO change will have a known, fixed cost. If the only state you’re changing is blending state, then you still have to change the entire PSO. And its rather unlikely that drivers will do some kind of diff between the two PSOs and only change which elements the user actually changed. So every PSO change is basically a gigantic state transition.

There’s no way to optimize that for specific use cases. If you only need to change vertex formats, and such changes are less expensive than blending state changes, too bad.

OpenGL

The OpenGL model has the greatest flexibility for optimization of specific use cases. Each individual state change is its own command. Therefore, if some particular state is more expensive to change than others, it’s easy.

However, there are a lot of potential drawbacks. Consider blending state.

Hardware blending state could very well be stored in one register for each output. So each output has a register, and this register controls both the glBlendFunci and glBlendEquationi state.

So tell me: how will this hardware handle a call to glBlendFunci? Since each output’s blend state is a single register, changing the blend state would require fully changing the register. But glBlendFunci only changes part of the register. Therefore, the driver would have no choice but to shadow the current blending state. It would read the shadow blend state, modify some bits, and then issue a hardware command to change the blend state accordingly.

And how would the hardware handle a call to glBlendEquationi followed by glBlendFunci? Well, it wouldn’t know that one follows the other. So it would have to send two hardware commands when one would be sufficient. An OpenGL driver would cache these things internally and detect the redundant command. But these APIs should be more low-level; a HW command should translate directly into one or more commands written into a command queue.

This is one of the reasons why display lists (for general commands) are untenable in OpenGL. A display list that contained only glBlendFunci calls is supposed to work with whatever equation blend state was already set. For the above hardware, executing a display list isn’t as simple as firing off a command queue; it requires lot of CPU gymnastics.

D3D10

Approach #3, the D3D10 immutable state object approach, is really little more than an attempt to fix the problems with #2 by guessing (or worst, enforcing) which state is grouped together. It’s basically the worst of all possible worlds. It has all of the OpenGL model’s problems with none of its user flexiblity. And while it may be faster in some cases compared to PSO state changes, it is still very much not display list-able.

The D3D10 model is also equivalent to functions like glViewportArrayv or glBindVertexBuffers, which also try to guess at what state is grouped with which other state.

So yeah, even though my predictions included approach #3, I’m going to have to backtrack on that. It’s a terrible idea.

Considering that both D3D and Metal use the PSO approach, I’d suggest that you get used to the idea. It seems to be the least bad of the alternatives.

elFarto · February 25, 2015, 12:18pm

[QUOTE=Alfonse Reinheart;1264559]PSO

The PSO model hides everything, allowing the driver the full freedom to do what it wants. Each immutable PSO can be stored in whatever way is optimal. And each PSO change will have a known, fixed cost to it. PSOs also are integral to making display lists (of generic state) work, since a PSO change is a big, monolithic entity.[/QUOTE]
This appears to be how the NV_command_list extension works, and the slides on it suggest that the driver will cache the state changes.

Regards
elFarto

Alfonse_Reinheart · February 25, 2015, 1:05pm

So that’s 3-for-3 command-queue-based APIs that use some form of the PSO approach. It would be ridiculous to accuse either Apple or NVIDIA of not “really having any extensive insight on the matter”.

So it seems rather certain we’ll see something similar for glNext.

mhagain · February 25, 2015, 2:45pm

I’ll add to this that at the very least the behaviour and performance becomes predictable.

Hidden costs of a potentially hardware-mismatched state change go away, you wind up knowing that if you want to issue a state change it’s going to have a fixed cost, and you can benchmark that and make design decisions around that cost. An API design like this is going to encourage you to keep on the fast path by making it pretty damn obvious where the fast path is, and that’s something useful.

l_belev · February 26, 2015, 1:44am

[QUOTE=Alfonse Reinheart;1264563]It would be ridiculous to accuse either Apple or NVIDIA of not “really having any extensive insight on the matter”.
[/QUOTE]

Back in the day when d3d11 was new everyone was jointly praising the api and was strongly recommending it. Especially nvidia. Now they all jointly changed the tune.
While nvidia probably have quite good insight, they still can be wrong.

As for that mostly everybody is doing the same, it doesn’t mean that they all thought it out throughly.
They just look what the others do and do the same because the instinct of the herd tells them that if most of the others are doing it, it should be the right thing and no more thinking is needed.
This way even if everyone is doing the same thing, it still can be very very wrong.
Of course i can be wrong too. Only time will tell.

mhagain · February 26, 2015, 2:09am

I’m not sure where you get the “herd following” thing from any of this.

The thing is, D3D 10/11 was a good API design in it’s day. But today it’s an almost 10 year old API design, and meantime GPU performance has continued to increase faster than CPU performance, with multi-core becoming commodity, and that 10 year old API design is just unsuited to today’s hardware. In other words, the API design was static, hardware evolution was not, and we’ve gone past the stage where this is an issue and something needs to be done.

States were also only part of the API design.

It’s also the case that the new APIs are more closely modelled after console APIs, and they’re already proven in the field. None of this is theoretical; what we’re talking about is something that is already out there and is already known to work.

l_belev · February 26, 2015, 6:48am

I don’t agree that d3d10/11 was good in it’s time. In particular the problems with the way it handles the states were exactly the same from the first day the api existed. They are not specific to any particular GPU or CPU generation or anything else.

About the herds - its just that people don’t like to think, like it is painful for them. They prefer just to follow the herd and rely on others to do the thinking.
I said this as a reply to the argument that if most do the same thing (apparently both d3d12 and metal has single object for all states) then it should be the right thing.
It could be that the first one does it in some [random] way and the others just imitate.
Then again they may have good reasons. I’m not very familiar with this matter yet. I just don’t buy the herd’s argument (that if others do it then it should be good only for this reason alone).
But we will see if there are real good reasons.

For now i don’t see any counters to my criticism against this model (which i stated earlier), so i remain sceptical.

Alfonse_Reinheart · February 26, 2015, 8:30am

No, only some of the problems with the D3D10 method were there at first. The state correlation issue, in an immediate mode API, is basically a minor inconvenience. Yes, IHVs have to deal with it, but so what? They probably had to deal with similar issues in OpenGL too. It’s something you handle at render-time. You’ll notice that the state change penalty is usually assessed when you next draw something, not when you actually change the state.

The D3D10 method gets most of its disadvantages when you’re trying to build a command-queue-style API with it. For an immediate mode API, it’s more or less equivalent to OpenGL, in terms of IHV implementation and overall performance. But in a command queue API, you get all of the downsides of the OpenGL model, with none of the upsides of the PSO model. It’s basically the OpenGL model, where you have to do more work.

So I would say that the D3D10 approach was not significantly better or worse overall than the OpenGL model for that style of API. But in a command queue API, it’s strictly worse.

Furthermore, NVIDIA has demonstrated a great willingness to use extensions to replace any part of the OpenGL API that they feel isn’t fast enough. If they believed that the D3D10 method was significantly superior to the OpenGL model for their hardware, wouldn’t they have introduced immutable state objects via some extension?

I didn’t see them come up with extensions to replace the blend state or viewport state with state objects. ARB_sampler_objects was a collaborative effort, with far more AMD people involved than NVIDIA ones. And NVIDIA has always been rather skittish on VAOs. So I see no evidence that NVIDIA was sold on the D3D10 method being better for their hardware.

You may be confusing love for D3D10 overall with love for any particular element of the API.

As for your notion of “herd” mentality on PSOs…

AMD’s Mantle was really the first of these next-gen APIs, and sadly there’s very little information readily available about it. However, there is some evidence that Mantle uses something rather like PSOs (the line about rolling shader stages into a “single object” is telling).

Apple Metal and D3D12 could be said to be taken from Mantle, as they were all announced well after AMD’s effort. However, it should be noted that there are significant differences here.

Metal in particular is clearly designed for mobile hardware; it’s not blindly following a “herd”. Specifically, their equivalent of a PSO doesn’t include one very important thing: framebuffers. Why?

Because RPS’s are allowed to change within a command queue, but framebuffers cannot. This is done because changing framebuffers on most mobile hardware is a very, very costly operation. So they designed their API to force you to start a new queue (clearly a heavy-weight operation) if you want one.

NV_command_list does something somewhat similar; you can’t change the framebuffers themselves within a single token stream, nor can you change the images you’re rendering to. But their PSO doesn’t really capture the framebuffer; it captures the image formats and binding qualities, not the specific bound images. So you can use the same PSO with different sets of images, so long as those sets of images are all compatible, though you do have to use a new token stream (aka: command queue).

D3D12 by contrast appears to stick framebuffers entirely into the PSO’s state. And therefore, you can change framebuffers as often as you change any other PSO state.

So it seems clear that the details of these APIs differ. And that suggests careful thought, rather than succumbing to some form of “herd” mentality. Sure, they all use the PSO approach, but their differences suggest that they’re not blindly applying something.

Furthermore, while I don’t trust NVIDIA to play nice with others (unless it serves their interests), there is one field of endeavor in which NVIDIA has proven themselves highly adept: making their stuff go as fast as possible. They are perfectly willing to make numerous changes to the API, whether via proprietary extensions or in tandem with others, that makes their hardware perform beautifully. They make no compromises on this, and they do not succumb to “herd” mentality when it comes to performance (see their bindless graphics stuff as an example. Pointers in shaders?).

NV_command_list is all about performance. So if they adopted the immutable PSO approach to this extension, it is reasonable to assume that they have actual working knowledge of what’s faster on their hardware. Whether it’s faster on everyone’s hardware is up for debate.

Oh sure, it’s possible that they’re all following the same wrong idea from Mantle. But that would require that AMD got it wrong first; Apple copied it, change it, and still got it wrong; and then NVIDIA copied it and kept it wrong.

It just seems rather unlikely.

l_belev · February 26, 2015, 9:15am

Mantle uses a d3d11-like scheme. While there is no official publicly available info, you can gather this by examining the exported functions in mantle32.dll. Among them there are ones for creating various state packets, e.g. raster states, blend states, depth-stencil states and others. They are all separate state objects. While this model is not good for portable vendor-independent api, I guess those state packets match their real hardware, so it must be good for them. Unfortunately that may make it not very portable to other GPU architectures.
So the single-object model does not originate from AMD.

Who invented it and who copied it (the single-object model) we may never know, but what i say about it is what i can conclude a-priori. That is it is NOT application friendly and will force all it’s users to make wrapper layers over it.
Sometimes the GPU vendors, while being very concerned about performance, make great efforts to optimize their own stuff while neglecting the inefficiencies in the actual applications (game engines) that use their GPUs and apis, inefficiencies caused by awkward api designs and rules. But i guess they don’t care much about those problems since they are equally bad for their competitors. I.e. if a game engine is more complicated and runs slower than what would be possible with a better api, this slowness will be the same on all vendors since its the same engine. And what they really care about is NOT to maximize the performance but to maximize their market share. This only translates to maximizing performance RELATIVE to the competition.

Alfonse_Reinheart · February 26, 2015, 10:55am

Unfortunately that may make it not very portable to other GPU architectures.
So the single-object model does not originate from AMD.

It is interesting to note that the APIs that have to be at least somewhat hardware independent (Metal and D3D12) both takes the PSO route.

That is it is NOT application friendly and will force all it’s users to make wrapper layers over it.

Um, low level APIs are not supposed to be application friendly. You’re intended to write wrappers around it. The point is that your wrapper will be designed for your specific application, so the API will not attempt to force a one-size-fits-all solution to you.

That’s why they’re going the intermediate representation route with the shading language. Only engine programmers are intended to write to it directly. They expose whatever higher-level semantics someone desires.

What a low-level API needs to do is:

Provide access to the hardware.
Make it clear what the fast path is. Preferably by making non-fast paths illegal or at least use a clearly different API (the way Metal requires you to use a new command buffer when changing FBOs).
Make sure that applications can be fast.

Being easy to develop for is not a requirement. Indeed, that often gets in the way of #3.

I.e. if a game engine is more complicated and runs slower than what would be possible with a better api, this slowness will be the same on all vendors since its the same engine.

What makes you think that game engines will be either more complicated or run slower because of this design choice?

glNext and D3D12 are being designed based on input from both game developers and hardware vendors. glNext in particular is a collaborative effort involving representation from companies like Valve, Unity, Epic, Activision/Blizzard, as well as AMD, NVIDIA, Intel, Qualcomm, and ARM. In fact, the presentation next week will not include a representative from any IHV; it’s all game developers.

So what exactly makes you think that you know more about what a proper low-level API should look like, for the purposes of performance, than the people who make those game engines and the people who make the hardware they run on?

I’m not saying that they can’t make mistakes. I’m just pointing out that you’re pitting “guy on a forum” against “people who undeniably know what they’re talking about.” And if “people who undeniably know what they’re talking about” all choose option A, and “guy on a forum” says to do option B, it is far more likely that option A is better.

OpenGL has gotten markedly less stupid in recent years. And part of that has been increased input not just from IHVs but from developers.

l_belev · February 26, 2015, 11:44am

Sorry, i got memories of past “arguments” with you (they are utterly pointless) and i put an end to it before it gets ugly