Dynamic states and Vulkan

Immutable pipeline states objects are a solution to sneaky shader recompilations. That said there are times when flipping some bits in the GPU state shouldn’t require creating a whole new PSO, this is likely why pDynamicState exists.

Aside from my specific use case with Vulkan that makes caching PSOs an expensive and annoying task; this twitter post made me realize this is not a problem exclusive to the kind of application I work on.

VkDynamicState has the problem that only exposes a small portion of the entire pipeline state and most desktop GPUs (or at least Nvidia GPUs) can modify more parts than what Vulkan exposes. Sure, pipeline derivatives exist; but it’s not the same thing, otherwise pDynamicState wouldn’t exist.

To name a few: blending, front facing and face culling state. These three can be modified by Nvidia without triggering recompilations and likely by other desktop hardware, but I don’t have experience to confirm those.

Both proprietary console APIs and OpenGL expose this dynamic state, but Vulkan doesn’t. It’s true that there may exist some hardware where changing these can lead to recompilations, killing the whole purpose of PSOs. This leads me to the purpose of this thread.

I wanted to know what are the thoughts of the community about having more dynamic states as optional features for currently immutable portions of the pipeline when these are guaranteed not to recompile.

Well, it is exactly what extensions are for. If NV supports it, it should expose it.

I also contemplated if dynamic state could be set outside command buffers (and the value in some way instead trivially be patched in vkQueueSubmit). If that was possible, command buffer recreation would not be needed in many cases, which means even less CPU usage.

That’s not the problem they exist to solve. They exist to solve the problem of interrelations between different sets of state on different hardware. Or more specifically, different GPUs have “bits in the GPU” that are related to other “bits in the GPU”. You don’t know which is which, and there is a wide degree of variability here.

They aren’t supposed to be the same thing. They exist partially to mitigate the problem you’re dealing with: having to make multiple PSOs with slight variations in their state. Derivative PSOs are (presumably) cheaper to construct.

… so what? How true is it of other hardware? For example, most mobile GPUs do blending as a fragment shader operation.

Vulkan needs to be a cross-platform API, so having extremely GPU-specific features like that wouldn’t be helpful. Features available across a broad swatch of GPUs should be optional features, but not those restricted to specific vendors.

That’s extension territory.

That being said, if a broad swath of hardware (not just one vendor) does indeed support some of these things as being distinct, then there would be a case for allowing that state to be optionally dynamic. But care should be taken to make sure that it isn’t restricting future development. We wouldn’t want GPU makers to avoid doing something faster just to avoid having front-facing become tied to some other rasterizer state.

There comes a point where flexibility in the abstraction creates more cross-platform incompatibility than is warranted. It’s a tradeoff: the more optional features you provide, the less likely it is that people will be able to make their code work across platforms. Vulkan obviously provides more user-exposed hardware variability in unextended Vulkan than OpenGL, but the API seems to try to make sure that everyone is writing more-or-less the same kind of stuff.

So on the one hand, you have to query a lot about image formats and their data sizes to be able to use the API. But you have to do this for all hardware equally. You have a set of features that may or may not be available, but most of these are peripheral and directly tied into a specific piece of rendering functionality.

By contrast, there’s no query for whether you can bypass the render pass architecture. Your GPU may get nothing out of it, but you still have to use it, no differently from if you were writing for a TBR.

What you’re talking about, particularly with regard to blending) is more like skipping the render pass system than asking whether or not the GPU can do tessellation. It feels like you would be writing code that isn’t Vulkan.

This would be far better done by allowing special kinds of CBs to not be isolated from one another state-wise. That way, you can build a short CB that just changes some dynamic state, and the following CBs in submission order inherit this state. This would also improve issues involving dynamic UBO/SSBOs and the like.

Leaking state across command buffers would go against Vulkan’s design. If that were the case, there would be no need for patches in vkQueueSubmit, you could change everything from a previous command buffer. This is how some proprietary APIs work.

Enabled dynamic states have to be passed at pipeline creation in pDynamicState.

Do you mean less? Even extended Vulkan exposes less hardware features than OpenGL.

From an application perspective, this would look like this:

PSOKey key;
key.rasterizer = GetRasterizerState(state);
// query the rest
if (device->dynamic_front_facing_supported) {
    // with some tracking to avoid dumb binds
    vkCmdSetFrontFacing(cmdbuf, GetFrontFaceState(state));
} else {
    key.front_face = GetFrontFaceState(state);
}

// skip hashing of unused entries
VkPipeline pipeline = GetPipelineFromCache(cache, &key);

// ...

As you can see, everyone would still be write more-or-less the same code, while also taking advantage of capable hardware.

Yes, I am aware blending is emulated on some mobile devices. And what? Optional capabilities are not forcing IHVs to implement them, nor software developers to use them. Otherwise no one would be using logical operations because these are not widely supported on mobile and these are still an optional core part of Vulkan.

I think you were responding to what I said, but my point would be that you would specifically ask for such “leaks” across CBs. For example, D3D12 has a specific kind of command buffer (bundle) that specifically inherits certain state from prior commands outside of that buffer. I’m suggesting that Vulkan could have a similar kind of buffer.

Variability, not functionality. In Vulkan for example, you have to ask if a particular image format is available. In OpenGL, it is available, and if some piece of hardware can’t do it, they have to find a way around it.

Vulkan exposes you to the variations in hardware, forcing you to make the choice of how to deal with the presence or absence of something. OpenGL defines more baseline functionality, forcing implementations to make up the difference.

If you have all of that infrastructure (PSOKeys, caches, and the like)… what’s the point? I thought the point of all this was to not have these things, for each object to pick pipelines based on what’s being rendered, not trivial things like front facing and the like.

If this isn’t about cleaning up the interface so that we can alleviate the burden on users, if it’s not about avoiding having a user pipeline cache with arbitrary pipeline keys… then what is it about? What’s the advantage? Having one fewer key in your pipeline cache? All of the infrastructure and complexity is still there, just with one fewer parameter.

You cite “taking advantage of capable hardware” but what advantage is being taken here? Is it going to improve rendering performance in some meaningful way?

It can still be queried if it’s natively supported.

Front facing is trivial, it’s just 2 bits. But blending isn’t, it takes 216 bits (27 bytes) if everything is perfectly packed:

Enable: 1 bit
RGB equation: 3 bits
A equation: 3 bits
Source RGB func: 4 bits
Source A func: 4 bits
Dest RGB func: 4 bits
Dest A func: 4 bits
Enabled components: 4 bits

27 bits per entry, 8 attachments: 27 bytes. With an unpacked approach it can go up to 256 bytes.

It’s not taking “one fewer key” in the pipeline cache, it takes a quarter out of it; and this is just blending. I’m not counting other major parts of the key like vertex formats.

My point is that all of that code needed to pull this off still needs to be there. Any hypothetical dynamic blending state would be a runtime test, so you can’t just remove that stuff at compile time. You might find a way to remove the 216 bits, but you could never remove the code behind that data.

Which means that you will need to write, test, and maintain the code needed to handle pipeline blending state. So where is the advantage in sometimes not using the code that will always need to be there and be tested? It seems like the only thing you gain is that sometimes, you don’t need to build some extra pipelines.

Is that a significant performance problem?

I could understand the desire for the feature if you wanted it to be required or that you intended to ditch support for any GPUs that don’t support it. But supporting both just doesn’t make sense to me.

Isn’t that something your code should have some control over? Since the shader logic is ultimately bound to the vertex data it is getting, the only things you would change are things like the various sizes and interpretation of arrays or their arrangement in memory. Wouldn’t it be easier to just standardize on one format of vertex data? Or at least, one format for each fashion of rendering (skinned meshes vs. non-skinned)?

Or are you writing a system where the user can throw anything at it and your code is expected to consume it as-is?

Templated functions and layouts, function pointers or virtual functions, code generated at runtime (JIT)… There are many ways to eliminate code execution at runtime.

Isn’t this true for all extensions? If you want to support hardware with or without an extension both paths have to be implemented and tested. If that were the case, no one would ever use extensions not available on all vendors.

That “sometimes” would be 85% of the times the application is executed if only the green vendor supports it.

I would ditch the immutable path for an specific state if all three Windows vendors and the two major mesa drivers support it.

Pipeline caching is at a 1.04% according to MSVC’s profiler, with blending disabled it got down to 0.58%.

I’m emulating the state of a modern GPU, so yes, the “user” can throw whatever it wants. In other words I have to be able to generate all possible pipeline states a modern GPU can use. But I didn’t want to bring up my specific use case since it is a niche category in graphics programming.