Vulkan and bindless

greyfox · February 18, 2016, 1:00am

A question to those who have progressed further in the documentation than me: how does one do ‘real’ bindless resource utilisation in Vulkan? I have skimmed through the descriptorset APIs, and they don’t seem than fundamentally different from the traditional binding slot paradigm — just that they allow to specify the binding slot layout to begin with. But what if I want to put pointers to buffers/textures in a buffer and/or texture instead?

Salabar · February 18, 2016, 1:20am

I barely dived in, but I remember reading some slides about bindless being pointless with Vulkan model. You are expected to reuse command buffers as much as possible, so there is not much explicit binding happening in the first place. And when you do need to change resources, you do it by switching a sub-DescriptorSet containing all the textures and buffers you need. Page 87. https://www.amd.com/Documents/Mantle-Programming-Guide-and-API-Reference.pdf

greyfox · February 18, 2016, 4:41am

Thanks for your answer. As mentioned earlier, I don’t see much difference between descriptor sets and traditional binding — you still have slots to which you bind resources. Only in the traditional model you have a predefined table with slots and with the descriptors model you build up the tables yourself. What I am asking about is the ‘real’ bindless stuff — where for example you pack pointers to textures in a different texture and then use that.

Alfonse_Reinheart · February 18, 2016, 6:04am

What I am asking about is the ‘real’ bindless stuff — where for example you pack pointers to textures in a different texture and then use that.

You don’t. And this shouldn’t be a surprise, since that was not a feature of Mantle, D3D12, or Metal, nor has the Khronos Group ever suggested that they would allow direct “bindless”.

I think you’re too focused on the general concept of “bindless” istead of the question of why bindless was an improvement in OpenGL. Because those reasons don’t apply in Vulkan.

Bindless mattered in OpenGL for 2 general reasons:

Binding is costly.
There is a limitation on the number of binding points, when hardware has no such limitations.

The ARB could have just added an extension to say that the number of binding points is arbitrarily large. By doing so, you could bind lots of textures and use indices to pick the ones that were appropriate for a particular model. That would allow you to make larger rendering calls without changing state inbetween.

However, they didn’t just do that. They choose bindless for OpenGL because merely alleviating #2 would not fix all the problems of #1 (there were also mutability issues for textures that would make this problematic).

In Vulkan, problem #1 does not exist. Changing descriptor sets is approximately as costly as setting uniform values (the mutability issues also don’t exist). So why bother with getting handles to samplers/images when you can just change a whole group of state and move on?

OpenGL needs bindless; Vulkan does not.

It should also be noted that one of the reasons that bindless is not core OpenGL is because Intel can’t implement it. Or at least not the way that ARB_bindless_texture is defined (handles that get passed to GLSL and converted into sampler/images). But they can implement descriptor sets adequately.

Dayofhay · February 18, 2016, 6:11am

I’m curious why Intel can’t implement bindless, from the lack of expertise on their part?

Alfonse_Reinheart · February 18, 2016, 6:50am

Maybe in their hardware, a 64-bit integer is not enough data to represent a unique texture/image. Maybe their texturing hardware is too separate from their shading hardware, such that the fetch unit has its own memory storage that the shader can’t write to.

Who knows? What matters is that they doesn’t allow it.

greyfox · February 18, 2016, 9:06am

Alphonse, than you for your (as always) very detailed and informative answer. At the same time, you are kind of jumping ahead of my question here The main reason I ask is because I am interested in understanding the “real” feature parity between OpenGL, Vulkan and Metal (especially the later, as I am primarily working with the Apple platform and really start to enjoy Metal). Especially since Khronos state in their presentation that Metal ‘retains the traditional binding model’ (which to me sounds as a criticism resp. implies that Vulkan does not retain it). So far, the resource binding model of Metal and Vulkan seem of equal power to me. Both allow you to bind resources to earlier defined arrays of slots (in Metal the slot layout is defined in the API, in Vulkan you need to create the layout yourself). Vulkan obviously offers more possibility for optimisation here (because of to batch updates). Again, this is more the question of curiosity than anything practical. Just want to compare things and understand the differences.

BTW, Mantle supports nested descriptor sets, I didn’t find a mention of that in Vulkan specs either (but then again, I didn’t have enough time to read them properly unfortunately — and its quite a lot to read )

Alfonse_Reinheart · February 18, 2016, 10:05am

It all depends on how you define things; the “traditional binding model” is about more than attaching an object to a context to render. The “traditional binding model” includes limits. Vulkan does not. The “traditional binding model” changes object settings individually; Vulkan can change them in groups. The “traditional binding model” is a single, linear array of binding points; Vulkan’s descriptor sets are bundled into groups of “binding points” (though apparently no longer nested).

Of course, there’s also the issue of memory with Metal vs. Vulkan, and this is probably the biggest “binding model” difference. As I understand it in Metal, each texture and buffer are considered separate memory objects. Whereas in Vulkan, you’re supposed to allocate memory in slabs and section it as you desire into textures/buffers. This allows you to attach specific memory slabs to a queue. Whereas in Metal, the driver recognizes that memory allocations are in use by a command buffer/queue solely because you bind it to render.

In Vulkan, if you try to execute a command buffer on a queue that doesn’t have access to that memory slab, it blows up.

Metal is less explicit in this regard, much more like OpenGL than Vulkan/D3D12/Mantle’s lower-level approach.

greyfox · February 18, 2016, 12:19pm

If I read this correctly, Vulkan too has binding limits (see 30.2). Disregarding batch updates (which are of course a very useful feature) it seems to me that binding models of Vulkan and Metal are isomorphic. Any Vulkan descriptor layout can be mapped to Metal array of argument tables and via versa.

[QUOTE=Alfonse Reinheart;39815]
Of course, there’s also the issue of memory with Metal vs. Vulkan, and this is probably the biggest “binding model” difference. As I understand it in Metal, each texture and buffer are considered separate memory objects. Whereas in Vulkan, you’re supposed to allocate memory in slabs and section it as you desire into textures/buffers. This allows you to attach specific memory slabs to a queue. Whereas in Metal, the driver recognizes that memory allocations are in use by a command buffer/queue solely because you bind it to render.[/QUOTE]

True, I also think that this is the most significant difference (even though you can do ‘manual’ memory management in Metal by allocating one big buffer and creating multiple textures backed by that buffer). Also, unless I have missed something, the resource residency APIs from ARB_bindless and Mantle seem to have gone away in Vulkan. From what I understand, the developer is now completely responsible for the ‘memory dance’, e.g. manually shuffling data to and from the device memory to make sure they are there when needed. With the residency API, you could at least offload that work to the driver. Overall, when they said that Vulkan was going to be low level they certainly were not kidding around BTW, I am not very fond of the vkBind*Memory APIs for non-sparse resources, they seem a bit odd compared to the entire immutable object paradigm.

bsupnik · February 19, 2016, 8:29pm

Hi Greyfox,

Here are the key differences in the binding models, I think:

Metal and GL, because they let you bind individual resources to individual binding points, don’t expose the combinatorial complexity of mixing and matching resources between draw calls.

So for example, if I have a shader that uses 8 texture units and between each draw call I bind a new texture to a different unit, this looks tame in GL or Metal. Hey, I’m only doing one bind.

Vulkan exposes that this isn’t trivial for the driver; a new descriptor set* has to be built for each draw call, with its contents mostly unchanged.

Vulkan lets you optimize this by using multiple descriptor sets. The assumption is that you (the app developer) know the frequency of updates, e.g. one table for environment resources (per frame), one for materials (per draw call), etc.

Vulkan, because it exposes the binding table as an object, lets you reuse bindings. (Think VAO, only good. So in the above example, the descriptor set for a given object’s material (containing the object’s albedo texture and normal map and maybe a uniform buffer with various material parameters) would be created once and saved forever. This removes from the per-frame loop the work of building that descriptor table each time we want to use the material.

(A Metal or GL app will still have to bind the textures, and that GL bind call is going to result in a descriptor set being created or copied for the hardware to use to draw using that bound resource.)

I’ve been looking at what a cross-API GL/Metal/Vulkan app might look like, and descriptor sets are one of those “impedance mismatches.” I can see two routes:

Ignore descriptor sets in the app - at draw time, build a descriptor set on the fly from whatever “bound” objects the app has. This could make a lot of descriptor sets and wastes time if some of the bindings are unchanged for that object (e.g. a material that never changes for a mesh). This approach doesn’t require app changes - so if you’re looking at this approach, yeah, it’s just a set of bindings instead of one binding.

To run this approach on Vulkan you have to manage memory yourself, e.g. how big should the descriptor set pool be, what do you do if you run out, when do you know you can recycle, etc. Metal/GL do this “for free”.

Require the app to provide descriptor sets, simulate them in GL by issuing a bunch of resource binds. This exposes descriptor sets to the app, which means the app can fully leverage Vulkan features e.g. to make static materials bind faster. The danger here is that binding the descriptor set might produce unneeded GL binds (which thrash the GL driver), so you have to use a little bit of app code to track and eliminate this case, wasting CPU. This is a total app change, so it’s only a good idea if descriptor sets are a good fit or you want to be Vulkan long term.

Cheers
Ben

It is possible that a shader has so few bind points that a separate set can be used for each bind point. I haven’t looked at any Vulkan limits yet, but I wouldn’t expect this to work for large shaders.

greyfox · February 20, 2016, 4:28am

Thank you for a detailed explanation, Ben! Without doubt, the Vulkan model is more efficient. I guess that Metal fills up a new descriptor set on every command buffer submit (as all bindings need to be repeated anyway) — and descriptor set layout can be extracted from the shader, so its already part of the pipeline state. I have experimented a bit with writing a descriptor-set-style wrapper on top of Metal, this is quite trivial to do and makes resource binding more convenient. To be honest, I am a bit surprised that Apple did not go this way from the start, it would map very well for the object-based API they are offering and its certainly a more elegant solution. I’m sure we’ll see this feature be introduced in the next version of Metal. On an unrelated note, after reading the Vulkan spec I feel like I understand why Apple left the Vulkan bandwagon, but thats a different topic

Alfonse_Reinheart · February 20, 2016, 12:02pm

Vulkan exposes that this isn’t trivial for the driver; a new descriptor set* has to be built for each draw call, with its contents mostly unchanged.

This assumes that a “descriptor set” represents something real in the GPU/driver/etc, rather than a useful abstraction that Mantle/Vulkan/Direct3D12 created. I imagine that the actual hardware doesn’t look like this.

A better way to look at it is that descriptor sets minimize the cost of a number of such state changes by making them all at once, via an efficient memcpy. You copy the entirety of the data in the descriptor set into… well, wherever it is that the GPU uses it.

Coupled with concepts like Push Constants and such, you are able to more effectively express your intent. If you have lots of changes, you make a single descriptor set change. If you are making minimal changes to small amounts of data, you use push constants.

I’m sure we’ll see this feature be introduced in the next version of Metal.

… why? Apple had the chance to change their resource binding model back when they released Metal on OSX. They didn’t.

bsupnik · February 22, 2016, 1:46pm

[QUOTE=Alfonse Reinheart;39825]This assumes that a “descriptor set” represents something real in the GPU/driver/etc, rather than a useful abstraction that Mantle/Vulkan/Direct3D12 created. I imagine that the actual hardware doesn’t look like this.
[/quote]

The GCN hardware does. UBOs, textures and samplers are all accessed through “descriptors” that have a known internal format and take some number of SPGRs.

Since using more SPGRs means less occupancy, the descriptor table sits in VRAM and is accessed through a memory fetch. This is why we can have really huge descriptor sets.

I’m not sure about Intel or NV hardware, but the AMD/GCN way of doing things isn’t at all surprising - the story of GPUs over the last few generations is moving from lots of specialized registers on chip to just talking to memory (and having more memory caches and things like that). Go back to the R300 and everything was on chip.

A better way to look at it is that descriptor sets minimize the cost of a number of such state changes by making them all at once, via an efficient memcpy. You copy the entirety of the data in the descriptor set into… well, wherever it is that the GPU uses it.

I think you’re burying the lead just a little bit…making the changes all at once is good, but having the descriptor sets be -persistent- is better.

The rules in the API for descriptor sets make it straightforward for implementors to store the results of CPU work and reuse them:

Descriptor sets are opaque.
Descriptor sets layout and size are totally immutable.
The contents of the descriptor set change only when the descriptor set is not queued or in flight.

So if I’m implementing GCN Vulkan, maybe I build a descriptor set in VRAM and leave it there and reuse it. The work of building the descriptor set is amortized (e.g. figuring out the actual base addresses for these resources, converting API settings to bit fields for the chipset) and the work of getting the descriptor from the CPU to GPU-local memory can be done only once.

If I have an older architecture that’s register based (e.g. R500-like) I can still write a series of command packets that set up my registers and save it in GPU local memory, then just queue the command packet when I need to “bind” the descriptor set. The fixed immutable layout makes this possible - there’s no just-in-time hardware mapping. So the win isn’t just changing things all at once, it’s getting to precompute and save some of that work.

Coupled with concepts like Push Constants and such, you are able to more effectively express your intent. If you have lots of changes, you make a single descriptor set change. If you are making minimal changes to small amounts of data, you use push constants.

Right…-intent- expression is the other big win besides persistence; apps know the relative frequency of resource updates, and descriptor sets let apps partition by frequency, which is a big win.

We already have variable frequency and persistence for UBOs (where we can put the environment uniforms in one UBO and the material in another) - descriptor sets makes this possible for data types that aren’t “straight C”.

cheers
Ben