Guidelines for selecting queues and families

xaxazak · May 3, 2017, 11:15am

I’m trying to decide on an algorithm to pick queues and families. Are there any generic, vendor-agnostic conventions around choosing these?

[ul]
[li]Should you use earlier-enumerated families first?
[/li][li]Should you use single-capability families before mixed ones?
[/li][li]If there are multiple families with the required capabilities, should you take your work queues from all these families or keep it on the same one?
[/li][/ul]
Also, do priorities affect the scheduling of queues from different families?

[HR][/HR]
The spec says (@4.3.1):

Note
The general expectation is that a physical device groups all queues of matching capabilities into a single family.
However, while implementations should do this, it is possible that a physical device may return two separate queue families with the same capabilities.

Does the use of “should” mean there’s never a valid reason for a GPU to offer multiple families with the same capabilities?
What about if a card had 2 separate GPUs? Should it just combine queues from both into the same family?

Can I use the above note to mean I should be able to write, with no performance penalty, a queue-selection algorithm that only considers total queues for each permutation of capabilities (eg: bundling all available graphics+compute queues together regardless of family)?

[HR][/HR]
Thanks for all assistance.

Alfonse_Reinheart · May 3, 2017, 12:37pm

I think you’re going about this the wrong way. You should start with what your application needs and how it intends to work with Vulkan.

For example, many applications need a graphic queue and a transfer queue. So the first question is whether the implementation actually has separate graphics and transfer queues. This could be through two separate queue families or one “everything” queue family with two separate queues. Either way will work; it’s simply a matter of what the implementation provides and what your application needs.

Some applications need compute capabilities, but if the compute operations feed graphics operations, it’s generally best to stick them in the same queue. Other applications have compute operations that are genuinely asynchronous.

Invariably however, you’re going to have to accept the possibility of hardware that has exactly one queue. So you still need to write your code so that it can work in those cases.

Should you use earlier-enumerated families first?

There is nothing in the standard that suggests that the order the families are enumerated in is relevant to their performance or functionality.

Should you use single-capability families before mixed ones?

You should use the queue families that provide the least functionality you need.

do priorities affect the scheduling of queues from different families?

They may. The specification doesn’t say that it can’t, so it’s possible that it can.

xaxazak · May 4, 2017, 1:14am

I have an application in mind but I’m trying to wrap Vulkan nicely (inside a larger library) without any vendor-specific or app-specific ideas. An application can specify what its queue needs are and the wrapper picks the queues and provides mutexes etc.

But just for a better idea, my current application has:

[ul]
[li]Multiple independent per-frame tasks like shadowmap generation, UI overlay, main geometry, plus optional rear/side view geometry.
[/li][li]One big renderpass to assemble everything and do post-processing.
[/li][li]Occasional asynchronous transfers, potentially multiple at the same time.
[/li][li]I’m not wanting compute yet, but I might in the future.
[/li][/ul]

Ok thanks, that’s what I figured - similar to memory types.

[ul]
[li]If there’s one dedicated transfer queue but I can use multiple transfer queues should I just shove all transfer ops down the dedicated one or should I use a non-dedicated queue when the dedicated one is in use.
[/li][li]If there’s two identical-capability queue families with two queues each, and I want two queues, should I use both from the same family or one from each?
[/li][/ul]

Salabar · May 4, 2017, 6:32am

You should get a working prototype first and optimize second. As a rule of thumb, you PROBABLY won’t benefit from using more than one graphics queue in rasterization-heavy tasks and it PROBABLY pointless to use more than one transfer queue for GPU-HOST transfer operations, but it’s impossible to tell for certain without profiling.

Alfonse_Reinheart · May 4, 2017, 7:20am

The whole point of a low-level API like Vulkan is that applications can tailor their use of the API to their needs. Such a wrapper is antithetical to that, providing a lowest-common-denominator solution instead of one specific to the application domain.

The best way to write such libraries is to give them the power to allow applications to ask the questions that their need to ask. For example, an application should be able to ask for up to 2 transfer queues that are completely independent of graphics operations. A different application should be able to ask for up to 4 transfer queues that operate independently of graphics operations, but are allowed to be within the same queue family as the graphics queue.

If a request cannot be satisfied, then the library says so.

There’s no way to know without testing. Broadly speaking, separate queue families will represent separate and independent pieces of hardware. But not always. Nor does this mean that queues within the same family always contend for the same resources.

This is why applications should be flexible; they may want to change how they arrange their queues based on specific hardware.

Try it and find out. I’ve looked at the Vulkan hardware database, and I have yet to find a card that exposes two queue families with identical capabilities.

xaxazak · May 4, 2017, 10:20am

I want to write the wrapper so that the application can tailor its use to its needs by describing the queues it wants. Similar to what you’re describing.
However, I wasn’t considering letting the application explicitly require independent operation - that seems like it would really restrict hardware.
It seems to me that you’d have to be very dedicated to craft your application so that it would benefit from knowing whether a queue is independent, especially considering we’ve got explicit scheduling priorities.
However, I guess things like whether to pick (when necessary) a compute vs a graphics queue for transfer ops depend on the app. I guess I’ll need to ask the app questions like this.

I have to say this feels wrong to me. APIs shouldn’t require the application developers to check each piece of hardware because A) they can’t test for future hardware, and B) any aspiring new entrants to the market will have a major handicap.
Sure, OpenGL had this problem in some areas too, but way less IMHO. If things stay like this then it even seems possible that applications may, some time in the future, find OpenGL faster than Vulkan.

It feels like, currently, the only ways to avoid this are to A) allow manual/end-user configuration, or B) write an automated performance tester that will try every configuration.

Flexible is fine as long as there is some way to determine the best configuration.

Overall it seems we’re swapping hardware knowledge (OpenGL) for application knowledge (Vulkan). Isn’t there some way we can have both? (This goes for both queues and memory types).

Alfonse_Reinheart · May 4, 2017, 11:04am

[QUOTE=xaxazak;42243]I want to write the wrapper so that the application can tailor its use to its needs by describing the queues it wants. Similar to what you’re describing.
However, I wasn’t considering letting the application explicitly require independent operation - that seems like it would really restrict hardware.[/quote]

How? The user asks for an independent queue, and if there isn’t one, then it asks for a non-independent queue.

Welcome to low-level API programming. If you want insulation between the app and the hardware, you should use an engine.

Vulkan isn’t for everyone.

No, it has way more of them. You just don’t see them at the API level. Look at the OpenGL AZDO presentation. Now realize how much of OpenGL you have to not use in order to achieve what Vulkan gives you by design.

Even more importantly, think about something like how best to use copy engines (aka: transfer queues). OpenGL has no way to expose even the concept of copy engines; all it has is asynchronous transfers. Which means that the implementation has to make assumptions about how you’re going to use PBOs in order to efficiently use them.

In Vulkan, even if you only use a single copy engine in hardware that has two of them, at least you’re the one making decisions on how to use it. You know that you’re streaming 8 textures and 4 buffers of data, back-to-back, into some memory; OpenGL doesn’t. You know exactly when you need to access those results; OpenGL doesn’t. Just that bit of low-level control, even if you’re not using the copy engines to their fullest potential, still allows you to do better than an OpenGL implementation’s guess about what you’re doing.

… No, they won’t.

Remember, one of the main purposes of Vulkan is to alleviate the API burden on the CPU. OpenGL can’t do that and won’t do that. So it might be possible that someone comes out with some new hardware that a particular Vulkan application doesn’t take full advantage of which OpenGL does. But that wouldn’t change the fact that the OpenGL application will still be horribly CPU bound.

Most end-users have no idea what it would mean to select how to distribute work among queues, let alone what the appropriate distribution for their particular hardware is.

Really, I think you’re overthinking this stuff. Reasonable ways to use the queues available can be deduced simply by looking at what queues are actually provided. If there’s an independent transfer queue, use it. If there are multiple independent transfer queues, maybe use them all, maybe just one. It all depends on how transfer bound you are. But at the end of the day, simply using an independent queue rather than a combined queue where reasonable gets you enough savings that “try every configuration” won’t be necessary for most applications.

There’s no such thing as a free lunch. If you want performance, you have to work for it. If you aren’t willing to do that work, then use an engine that someone else did the work on.

But that work has to be done, one way or another.

Also, memory types are pretty trivial to work out. You can tell which memory types deal with GPU memory vs. CPU memory. You can ask how much memory each pool has. You can ask about coherent vs. non-coherent. There is nothing left to chance or question about the behavior of a piece of memory.

The only thing that might confuse you with memory stuff is AMD’s 256MB device-local-yet-CPU-accessible buffer. And that you can simply ignore if you don’t want to write specialized code to take advantage of that.

But at the end of the day, low level means low level: Vulkan does not abstract away details of hardware that applications can take advantage of for performance.

Salabar · May 4, 2017, 11:57am

APIs shouldn’t require the application developers to check each piece of hardware because A) they can’t test for future hardware, and B) any aspiring new entrants to the market will have a major handicap.

A) If you manage to implement i.e. post-processing running in parallel with shadow mapping you can at very least rely on future hardware not performing WORSE. It will have smarter schedulers, bigger caches, smarter compiler. Reaching optimal performance on another hand always requires fine tuning and it always will.
B) If you can afford to develop a modern engine and a half decent SDK using Vulkan, you can afford to buy a dozen of GPUs. Real-time graphics is increasingly expensive both technologically and artistically, so it is safe to assume that API is definitely not the biggest handicap.

xaxazak · May 4, 2017, 1:49pm

Would they use the non-independent queue any differently? If not, there’s no need to ask. Is this worth considering? I don’t think I would use it, but I guess some people might.

It feels like what we have now is an API that lets you select between X, Y, Z, but doesn’t give you enough info to do so intelligently. So people will end up just hard-coding the one that worked best when they tested it. If things change the applications won’t adapt automatically like they do in OpenGL.
IMHO this has the effect of discouraging hardware innovation, as any divergence from the status quo will reduce performance.

Want to make a new type of VRAM with slow writes but concurrent reads? Want to use multiple GPUs that share memory but have better access to certain bits? - Vulkan apps will not take advantage of these. In OpenGL the driver can use them properly.

To me it seems the fastest and most future-proof solution would be somewhere between OpenGL and Vulkan.
Go ultra-low-level only when it measurably improves performance.
Allow drivers to make (or help make) decisions when they have info that can help improve performance (e.g. memory type selection based on application-provided usage info).

Aside: On the flip side, I do wonder if Vulkan needs to track images and buffers as it does, rather than just issuing commands that require the data to be correct. Just say “copy an image of format/layout/dimensions Q from memory A offset X to memory B offset Y”. Let the application track what objects are where - it usually does anyway. And require that all the GPU synchronization stuff is explicitly specified so it doesn’t need to track image/buffer use.

[QUOTE=Alfonse Reinheart;42244]No, it has way more of them. You just don’t see them at the API level.[/QUOTE]I’m not sure if we’re talking about the same thing, but you may be right. I haven’t yet watched the video but I’ll take a look soon.

[QUOTE=Alfonse Reinheart;42244]There is nothing left to chance or question about the behavior of a piece of memory.[/QUOTE]Relative speed isn’t provided. That could be important, although it’s hard to quantify. Even relative-speed-per-task might vary. Read vs write speed. Or some memory might be better as random-access read-heavy texture memory. Other memory might be better for sequential read-and-write. Those seem like ideas that hardware of the future could exploit if it is allowed to.

[QUOTE=Alfonse Reinheart;42244]Really, I think you’re overthinking this stuff[/QUOTE]Quite probably. It just feels like bad coding practice to just assume stuff. But you’ve helped me (lots) to decide how to do it - so thanks heaps for your input.

[QUOTE=Salabar;42245]B) If you can afford to develop a modern engine and a half decent SDK using Vulkan, you can afford to buy a dozen of GPUs.[/QUOTE]I meant new entrants to the GPU market, not the application market. Sorry, I should’ve specified that.

Alfonse_Reinheart · May 4, 2017, 2:44pm

As a matter of reasonable API use, they would. You don’t want to have any more vkQueueSubmit calls than you strictly need, so if you use the same queue for transfers and graphics, you should submit them in the same call. You can build the command buffers asynchronously, but you would be submitting them on the same queue, in the same thread.

Whereas if the transfer queue was independent, then the same thread that built the command buffer could submit it.

[QUOTE=xaxazak;42246]It feels like what we have now is an API that lets you select between X, Y, Z, but doesn’t give you enough info to do so intelligently. So people will end up just hard-coding the one that worked best when they tested it. If things change the applications won’t adapt automatically like they do in OpenGL.
IMHO this has the effect of discouraging hardware innovation, as any divergence from the status quo will reduce performance.[/quote]

You speak of the OpenGL way as though it were some perfect solution. It’s not. Why? Because in that world, driver quality is, at best, pure luck. If your OpenGL application works across all implementations, then you managed to dodge the thousands of driver bugs that you otherwise would have encountered.

The prime reason to keep this stuff in the application instead of the driver is to make drivers simpler and therefore less bug-prone. And it’s working, to varying degrees. AMD’s Windows Vulkan driver is much better than their OpenGL driver.

A secondary reason is your assumption that OpenGL drivers get these things right. If that were the case, then there wouldn’t need to be application-specific drivers, would there? Drivers that know exactly how an application works and are tailored for that application’s particular rendering methodology.

By removing this code from the driver, we put the application-specific code in the application, not the driver.

First, Vulkan apps will take advantage of them when they’re coded to do so (though the use of extensions/later Vulkan versions that expose the functionality). If you’re using an engine, once the engine maker updates the code, you’ll be fine. Second, you assume that the OpenGL driver will be able to use them properly; if that were the case, we wouldn’t need Vulkan.

And third, you assume that the OpenGL driver won’t have the same “quality” that we all “enjoy” today.

[QUOTE=xaxazak;42246]To me it seems the fastest and most future-proof solution would be somewhere between OpenGL and Vulkan.
Go ultra-low-level only when it measurably improves performance.
Allow drivers to make (or help make) decisions when they have info that can help improve performance (e.g. memory type selection based on application-provided usage info).[/quote]

We’ve tried “application-provided usage info” before. We called them buffer object usage hints. They failed. Miserably. Even the more modern approach like D3D’s usage parameters and glBufferStorage’s parameters. They don’t work because such parameters lack the sophistication to allow users to actually explain their application’s behavior.

If you want to stream to a buffer, and then you’ll render with it, and then invalidate it and write some more. You cannot describe that pattern of usage. All you can say is, “I need to be able to write to it in this way”. That doesn’t help the driver select the proper memory type for that buffer, nor does it inform the driver that it should allocate twice the space, so that invalidation can work. Or maybe you need it triple-buffered, so the driver should allocate 3x the space.

Usage hints do not work.

That’s pretty much what Vulkan does presently. A VkImage is just a location in a memory allocation, plus some format information. The purpose of it being an object rather than just a memory allocation + offset is that doing so makes descriptor setting faster. Converting a VK_FORMAT enumerator into the actual bits needed for a texture descriptor takes time; by sticking that in a VkImage object, you can just copy those bytes into the descriptor.

Also, have you looked at what VkImageCreateInfo contains? That’s a lot of stuff; you wouldn’t want to have to pass a bunch of those to vkUpdateDescriptorSet calls.

Um, Vulkan already requires that “all the GPU synchronization stuff is explicitly specified”. VkImage is not tracking anything about the image; that’s why you have to explicitly tell those APIs that take images what layout the image is in. Well, tracking can be done by layers, but those are debugging tools. They’re not meant for release builds.

And the ability to turn on/off such tracking is another reason why those objects need to exist. Vulkan’s hard enough to code in even with debugging layers; taking away the ability to even detect such errors would be insane.

To what end?

The performance of a piece of memory, for the GPU, is based entirely on which heap it comes from. Right now, GPUs have at most 3 heaps: a device-local heap, main CPU memory, and that oddball heap that AMD offers that’s both device local and CPU accessible.

Why do I need to know the exact read speed of any of these heaps? I know which one is the fastest for the GPU to read from (any heap marked device-local). I know that using non-device-local heaps means slower GPU access. What do I gain by knowing exactly how fast these accesses will be?

You put stuff in the fastest heap for the GPU, pursuant to your needs to modify that data from the CPU and of course whether it will fit.

The only question that leaves is whether cached access is worth bothering with. And it’s very hard to describe that sort of thing, since it has to do with access patterns and the like. It isn’t just a number.

How would future hardware exploit them? The only way that could happen is at the level of the actual memory chips themselves. Which means these represent different pools of memory, which limit what you can do with them. Memory that only works for vertex arrays or textures or whatever is memory that can’t be used for other things.

I don’t see IHVs making hardware that pigeonholed. GPUs are getting more general purpose, not more special-cased.

Salabar · May 4, 2017, 3:02pm

Relative speed isn’t provided. That could be important, although it’s hard to quantify. Even relative-speed-per-task might vary. Read vs write speed. Or some memory might be better as random-access read-heavy texture memory. Other memory might be better for sequential read-and-write. Those seem like ideas that hardware of the future could exploit if it is allowed to.

Trying to predict the future is pointless. In 15 years our normal GPU might turn into some sort of FPGA and Vulkan will become entirely obsolete, so we’ll have to invent an API from scratch. And once again, future hardware will not run current application slower under any circumstances, so developers should focus on what can be done now.

I meant new entrants to the GPU market

They have a luxury to look at actual applications using Vulkan and to tailor their chips accordingly. Until patent system will eat them alive that is.

xaxazak · May 25, 2017, 5:27pm

Sorry about the ultra-late reply. I was kinda distracted by events recently and I’m just getting back to coding.

[QUOTE=Alfonse Reinheart;42247]As a matter of reasonable API use, they would. You don’t want to have any more vkQueueSubmit calls than you strictly need, so if you use the same queue for transfers and graphics, you should submit them in the same call. You can build the command buffers asynchronously, but you would be submitting them on the same queue, in the same thread.

Whereas if the transfer queue was independent, then the same thread that built the command buffer could submit it.[/QUOTE]Sorry, I thought you meant independent as in different hardware, not just different queue. Yes, if the queue is the same I understand that.

I didn’t mean to come across that way, sorry. I moved from OpenGL because it was messy and evolved and slower. It desperately needed an overhaul or a replacement. And yes, the drivers are far from perfect. I’m just talking about areas where to me it looks like we unnecessarily stepped backward.
I’m mostly thinking about potential. A new clean API could avoid many of those issues.

Whereas the OpenGL apps could take advantage immediately (assuming the OpenGL drivers come with that functionality). Games these days aren’t well maintained for very long - often around a year or two until they’re mostly abandoned unless they’re big studio ones or have dedicated devs.
That directly relates to the incentive for innovation.

Buffer. Sequential access. Frequent use (perhaps a scalar usage variable). Partial write. Read-once-per-write. Very different from Texture, Frequent use, Write-once.
And as for the space for invalidation - well, you’re requesting a specific total size so that’s already taken care of.
I didn’t know they failed miserably before. Is there any reason why they cannot be made to work well?

Yes. But that’s right now.
If we only want to consider 3 heaps, why bother with having an API that can handle more.
And if the API can handle more, isn’t it better if it can handle more well?

When you add an explicit barrier the barrier doesn’t specify what command it’s waiting on. It does specify an image, so there must be some way to tell if certain operations on an image are finished.
So doesn’t that mean there must be some sort of concurrency object (mutex?) linked (directly or indirectly) to the image.
But even if that’s right, I guess I shouldn’t be saying tracking if I’m just talking about a mutex. But (again, if that’s right) that image-mutex link is what feels unnecessary to me. Instead just use a raw mutex or other synch object, it’s more flexible.

They’re optimized for purpose so they’re likely to be far faster for their preferred role. They will probably be able to work for other roles, too. We already have major differences, structurally and performance-wise, between CPU RAM and GPU RAM - yet they’re both capable of acting like the other even if they’re slower at it.

Considering the vast majority of GPU memory in taxing games is probably read-only textures you’d probably gain a lot and lose very little by making half your VRAM texture-optimized. Of course you’d need a GPU and bus that could handle the increased data flow - but memory speed is the limiting factor IIRC.
Tesselation was added somewhat recently. That’s special-cased. It all depends on how much of an advantage you can get from special cases.

But that’s kinda the problem - it’s not about predicting the future, but shaping it. They won’t innovate unless there’s an incentive, and with Vulkan that incentive is reduced because new ideas (especially ones that could work in OpenGL but not in Vulkan) won’t get the boost they otherwise would.
It’s like making roads that are optimized for petrol (don’t ask me for an explanation of how). You’ll discourage inventing electric vehicles.

I don’t think Vulkan is broken or beyond repair. I definitely think OpenGL is.
Some advanced hinting extensions/additions could reduce or fix most of my issues.

Ok, enough ranting. Thanks for all the comments and help you’ve given.

Alfonse_Reinheart · May 25, 2017, 7:09pm

[QUOTE=xaxazak;42362]Whereas the OpenGL apps could take advantage immediately (assuming the OpenGL drivers come with that functionality). Games these days aren’t well maintained for very long - often around a year or two until they’re mostly abandoned unless they’re big studio ones or have dedicated devs.
That directly relates to the incentive for innovation.[/quote]

People generally do not buy new hardware to play old games. So long as the new hardware does not run the old games worse than the old hardware, they’ll be more or less satisfied with it. Gamers want to know how well it works with new hardware. And new hardware means new development, which means they’ll be using an appropriately new version of Vulkan that has your new, innovative features.

In 20 years of PC GPU development, I have seen exactly one instance of hardware innovation that could cause a genuine regression in the performance of old applications: tile-based renderers. Their decidedly unorthodox method of rendering means that some things which would be fast on an immediate renderer will die on a TBR.

But the thing is, this problem existed despite the higher level OpenGL/Direct3D approach. It existed because those high level abstractions were still the wrong abstractions for this new approach to rendering. The Vulkan render pass model matches TBRs much more than OpenGL/D3D’s framebuffer model.

Which proves that if there is such a radical innovation, you’d need to update your API anyway. Every abstraction has underlying assumptions about what it models. It will model some things more effectively than others. And if something innovative comes along that defies your model, then you’ll need a new one.

[QUOTE=xaxazak;42362]Buffer. Sequential access. Frequent use (perhaps a scalar usage variable). Partial write. Read-once-per-write. Very different from Texture, Frequent use, Write-once.
And as for the space for invalidation - well, you’re requesting a specific total size so that’s already taken care of.
I didn’t know they failed miserably before. Is there any reason why they cannot be made to work well?[/quote]

Your example shows precisely the problem. What does “partial write” mean; how much writing does it take before it becomes “partial”? How “sequential” must access be to qualify for “sequential access”? How “frequent” does the use need to be to quality as “frequent”?

Every user will have different ideas about what these things mean. For one user, “sequential access” means each byte one after the other. Which means that they won’t use this flag for indexed vertex rendering, since it won’t read things sequentially. Was that the right call? Who knows. Is a memory copy operation “sequential access”? Who knows?

This means that users will invariably get things wrong. And because it doesn’t cause the API to give an error (since there is no “wrong” or “right”, as far as correct behavior is concerned), applications will ship that use the wrong bits. Indeed, different IHVs can and will interpret these differently.

This will lead to a proliferation of application-specific optimizations written and maintained by IHVs rather than the actual developers of those applications. Alternatively, this will require IHVs to create complex heuristics to figure out what the user really meant. Either way, you get large, bulky, and buggy drivers.

Buffer objects in OpenGL had similar usage hints. And the same thing happened there. What’s the difference between DYNAMIC_DRAW and STATIC_DRAW? Or between DYNAMIC_DRAW and STREAM_DRAW? The specification doesn’t make it clear. So users would just use whatever they thought best.

This lead to AMD (at least) simply ignoring the usage hint; they’d move the memory around based on how you actually used it.

OpenGL abandoned this nonsense in favor of usage [i]parameters[/i]: the user specifies exactly what operations they will use, and it is a hard error to use a storage buffer in a way counter to these parameters.

FYI: “invalidation” refers to the ability to designate that a buffer/image’s storage no longer matters, and therefore, the implementation is free to orphan it if it is currently in use by any rendering operations. That also means the implementation will allocate memory on the spot for you to use in your next operation, if the current memory is still in use. Basically, it’s multi-buffering behind the scenes, but you have zero control over it.

[QUOTE=xaxazak;42362]Yes. But that’s right now.
If we only want to consider 3 heaps, why bother with having an API that can handle more.[/quote]

There’s the zero/one/infinity rule. Though really, Vulkan devices can only have 32 memory types total, since vkGetBuffer/TextureMemoryRequirements specifies the memory types they work with as a 32-bit bitfield.

So that was not the best decision ever…

“Well” is defined by expectations. My expectation as a Vulkan user is that I will have to put in a lot of work to make my application fast. That my application will have to adjust itself to the needs of the hardware. Whether that’s in memory heaps, heap sizes, which kinds of images can be allocated from which heaps, queue selection, etc.

So the API handles more heaps well enough for my needs. Well, except for the 32 memory type limitation…

[QUOTE=xaxazak;42362]When you add an explicit barrier the barrier doesn’t specify what command it’s waiting on. It does specify an image, so there must be some way to tell if certain operations on an image are finished.
So doesn’t that mean there must be some sort of concurrency object (mutex?) linked (directly or indirectly) to the image.
But even if that’s right, I guess I shouldn’t be saying tracking if I’m just talking about a mutex. But (again, if that’s right) that image-mutex link is what feels unnecessary to me. Instead just use a raw mutex or other synch object, it’s more flexible.
[/quote]

That’s not the way barriers work. Barriers are commands; they therefore are added to the command buffer. The command that the barrier waits on is therefore everything that happened before that barrier in the command buffer/queue being executed.

Barriers do not wait on images to be finished. Indeed, it is very possible for you to specify the vertex shader as an execution barrier’s source stage, but the fragment shader stage is the one that actually updates the image. The result will be undefined behavior, not an error.

Images matter for memory barriers. But as in the above example, that doesn’t work if you fail to get the execution dependency be correct (ie: the source pipeline stage must be the one accessing it). So there is nothing in Vulkan to tell when operations on images are finished; you tell Vulkan when operations on images are finished.

This is after all why you must manually control image layout (a concept that doesn’t exist in OpenGL). And note that when you change image layouts, you have to tell Vulkan what both the destination and the source layout are. Because Vulkan genuinely does not know what the layout will be by the time that command gets executed.

This is the essential essence of an explicit API.

I dispute your assumption here. With deferred rendering and various other techniques, it seems clear that “read-only textures” are not nearly as significant of a bottleneck as you suggest. The more light sources you have, the less “read-only textures” are your bottleneck, and the more reading from data you’ve written matters for performance. Further, people are now starting to employ limited ray-tracing, which makes “read-only textures” even less important to performance.

This is precisely why specialized hardware is being avoided rather than pursued. You never know where the next great rendering advance is going to come from. If you start trying to guess, you often guess wrong.

GPUs are becoming like CPUs, with standardized hardware and interfaces. They may get faster at what they do, but they’re not going to be doing it in a significantly different way year-to-year.

I wouldn’t classify something from 8 years ago as “somewhat recently”. Yes, the Radeon HD 5xxx series and Direct3D 11 really are that old.

So that means it has been about 8 years since the last major instance of special case hardware. If hardware has been stagnant for that long, odds are good that we’ve hit a local minimum and won’t be making any major shifts in the near future. And thus, any major advance would have to be coupled with a major API change, even for something as high level as OpenGL.

You can’t shape the future without deciding what that future you want to shape is going to look like. And every time people try, it goes badly.

[QUOTE=xaxazak;42362]They won’t innovate unless there’s an incentive, and with Vulkan that incentive is reduced because new ideas (especially ones that could work in OpenGL but not in Vulkan) won’t get the boost they otherwise would.
It’s like making roads that are optimized for petrol (don’t ask me for an explanation of how). You’ll discourage inventing electric vehicles.[/quote]

As GPUs become more generic, innovation will come from the people it should: software developers. That is what Vulkan encourages.

One other thing should also be noted. If Vulkan stifles innovation so much… why did the IHVs choose it? You have to remember that Mantle was what got the ball rolling on these lower-level APIs. And Mantle came… from an IHV. Vulkan was designed primarily by a consortium of IHVs.

If they truly believed that a higher level approach left them some significant room for innovation, I’m pretty sure it would be in Vulkan.

xaxazak · May 26, 2017, 6:51pm

Not if it were just something like memory module specialization. OpenGL could do that without any API changes.

[QUOTE=Alfonse Reinheart;42363]What does “partial write” mean; How “sequential” must access be to qualify for “sequential access”? How “frequent” does the use need to be to quality as “frequent”?[/QUOTE]I didn’t explain those in detail, sorry. I was thinking partial vs complete write, where complete either means you write the whole buffer at once, or do one write before use - not sure. I wasn’t thinking too hard about it. For the others, scalar variables might be better. You can then define some sort of usage mapping in the spec, perhaps based on expected bytes per second (for frequent), perhaps average sequential-ness in bytes. Of course it’s up to the app to get good info. But lots of usage is extremely different so approximation should work fine. Just make sure you define it well - that was perhaps the issue previously?

[QUOTE=Alfonse Reinheart;42363]I dispute your assumption here. With deferred rendering and various other techniques, it seems clear that “read-only textures” are not nearly as significant of a bottleneck as you suggest.[/QUOTE]I meant they’re the bulk of RAM use, not bandwidth. Unless you exhaust the non-RO-texture RAM and are forced to put non-RO-texture data in RO-texture RAM you lose nothing by creating dedicated RO-texture RAM. But maybe you won’t gain too much at present in many circumstances - although you should gain a little. However, it was just an example.

I’m not talking about the API shaping the future, but talking about giving hardware vendors more freedom to shape the future themselves in the way they want to - i.e. the opposite of deciding what the future looks like. Vulkan is currently the one trying to shape the future, IMHO a bit too restrictively.

Why not hardware vendors also? The main aim of GPUs is to give us better graphics. IHVs have innovated over the last decades to give us a ton of improvements on that front - not just in scaling and clockspeed. Unless you want a pure compute unit with no TMUs, ROPs, HW-optimized pipeline stages, etc.

IHVs are businesses, and people. Businesses usually make decisions to optimize profit, not necessarily quality (competition gives you that, and standards reduce competition) - which is perhaps why we have the TBR compromises rather than a TBR extension, although maybe that’s a bit cynical. People are imperfect (e.g. issue with only 32 memory types that you mention). One possibility is they found that OpenGL spent too much time on high-level processes and decided on a low-level API. Being low-level became (perhaps subconsciously, perhaps just to be consistent) part of the focus, rather than just building the quickest and most flexible API they could.

If there were more ISVs there’d be a smaller set of common architecture and Vulkan would need to be higher level to accommodate them all.

Alfonse_Reinheart · May 27, 2017, 7:52am

My point is that they have been tried and they do not work. Even the “scalar” thing has been tried. NV_vertex_array_range had that for GPU memory allocations. How well did it work? Not well at all; NVIDIA basically had to tell developers, “these numbers correspond to these memory types.”

After trying something multiple times and watching it fail, at some point you have to recognize that it just isn’t working.

I don’t understand. Your original point was that this special read-only texture RAM would be special because it would be faster to access:

It’s clear you were speculating on the development of faster memory.

And what you lose by having dedicated RO-texture RAM is that you can’t use it for other things as effectively. You may be increasing performance of one use of a GPU, but you’re limiting the creativity of your users to find other ways to do what they want. The time and money you invested into that dedicated RO-texture RAM is time and money you didn’t invest into more generally useful things.

IHVs have not given us tons of improvements on graphics. Since the advent of programmable hardware, IHVs have given us more power; it’s the users who have given us “a ton of improvements” towards better graphics, using the tools IHVs have provided.

IHVs did not add special hardware for stencil shadows or shadow maps; they simply provided stencil buffers and depth comparison texture fetches. IHVs did not add specialized hardware to make deferred rendering possible; they just allow us to render to textures and later read from them. IHVs did not add specialized hardware to allow raytracing; they just gave us generalized reading and writing in shader instantiations. And so forth.

Image load/store, SSBOs, compute shaders, etc do not give us better graphics. They allow programmers to give us better graphics.

Just like CPUs don’t have a “show window” opcode; they allow programmers to write such things themselves. That is the evolution of hardware, whether CPU or GPU: towards greater programmability, flexibility, and generic functionality.

The only reason GPUs haven’t gotten rid of those particular things you cite is because they’re sufficiently performance-critical and high-performing that generic operations would be slower and thus impact performance. And it’s not like TMUs have not gotten simpler; some hardware does anisotropic filtering as primarily a shader-driven computation process. Also, as I understand it, TBRs don’t have ROPs of any kind.

OK, let’s follow your idea down to its inevitable conclusion.

Let’s rip out of Vulkan everything that was added to it to allow TBRs to work efficiently. Let’s redesign Vulkan so that it does not model TBRs at all. Let us call this API something odd and nonsensical:

Direct3D 12.

The sole purpose of the render pass model is to make TBRs render efficiently. Direct3D 12 has no render passes nor anything really like it. D3D12’s pipelines are not constructed based on a specific subpass of a render pass. D3D12’s secondary command buffers are not given a specific subpass of a render pass to build commands for. In fact, D3D12 doesn’t have secondary command buffers at all. There’s no reason for them to exist; if you want to asynchronous command building, just send more direct command buffers in your submit call. D3D 12 has no input attachments as well.

Given that, what would a TBR extension to Direct3D 12 have to look like?

Well, it would have to add all of that stuff. Render passes would have to be added. The extension would have to annotate every command as to whether it could execute within a render pass instance or not. There would have to be a structure for building pipelines against a specific subpass of a render pass.

A TBR needs to be able to modify how it builds commands based on the nature of the subpass. As such, we now need a way to asynchronously build command buffers that know they’re destined for a particular subpass. That means adding a new kind of command buffer would have to be added. Plus, we need to have annotations as to which commands can appear in these new CBs.

A new shader construct, input attachments, would have to appear, along with the functions needed to interact with it.

Oh and one more thing. Normally, extensions are things you ask for. But a TBR cannot efficiently implement D3D 12 as written. So on a TBR implementation, you would essentially be forced to use the TBR-based D3D 12. So you would be unable to render the normal way.

This would be a mandatory extension based on the specific details of the implementation.

Which means if you want to write code for both kinds of platforms, you have to do a lot of work. You would have to write a significantly modified backend for TBR and non-TBR platforms. Indeed, you’d have to use different shaders. A non-TBR deferred renderer would attach the written textures as regular textures; a TBR one would make them input attachments.

You can argue about whether some ideas in Vulkan are good or bad, whether improvements could be made towards some purpose. But its render pass system is neither a mistake or a compromise. It’s a good thing for everyone; having a single API that works across TBR and non-TBR platforms efficiently is good.

And it’s not like the render pass model is useless for non-TBR hardware. Read/modify/write operations were possible on older APIs, but these were esoteric things that required lots of knowledge and care. In Vulkan, input attachments take the arcane nature of them and make them much simpler and easier to use.

In their defense, the 32 memory type limitation isn’t really that bad. Let’s look at the numbers.

Vulkan strictly defines the possible propertyFlags bitfield combinations that are available. There are exactly 9 valid bit combinations. Which means that, in order to even possibly pass the 32 type limitation, you would need to have more than 3 heaps.

And even that is unlikely, because there’s really no reason for a single heap to offer all 9 property combinations. If you look at realistic scenarios, the limit would probably be 4 memory types, spanning the full range of host visibility options.

Which means that in reasonable implementations, you could get up to 8 memory heaps before you would start to run into limitation issues.

So the choice is to use a limit that’s high enough that it’s unlikely to be hit, or force the user to allocate memory for the return value. And since getting these bits is something applications have to do quite a lot (you always have to check image/buffer requirements, so that you can plan your memory allocation strategy), they probably decided that using a high enough limit was adequate.

Or maybe the collective leaders of this industry have some idea what they’re doing.

xaxazak · May 27, 2017, 9:50am

[QUOTE=Alfonse Reinheart;42368]My point is that they have been tried and they do not work. Even the “scalar” thing has been tried. NV_vertex_array_range had that for GPU memory allocations. How well did it work? Not well at all; NVIDIA basically had to tell developers, “these numbers correspond to these memory types.”

After trying something multiple times and watching it fail, at some point you have to recognize that it just isn’t working.[/QUOTE]So it’s basically just programmers misusing it, then? Is that really an unsolvable problem? Can we not say get a layer to report back and say “you’re misstating your image/buffer memory usage”.
Plus, history’s full of things that didn’t work for ages until they did.

[QUOTE=Alfonse Reinheart;42368]I don’t understand. Your original point was that this special read-only texture RAM would be special because it would be faster to access[/QUOTE]Yeah, I think I misunderstood your point about the bottleneck. I did mean they were faster to access.

[QUOTE=Alfonse Reinheart;42368]And what you lose by having dedicated RO-texture RAM is that you can’t use it for other things as effectively. You may be increasing performance of one use of a GPU, but you’re limiting the creativity of your users to find other ways to do what they want.[/QUOTE]True. These are GPUs though, which already feature a ton of texture-specific hardware.

[QUOTE=Alfonse Reinheart;42368]The time and money you invested into that dedicated RO-texture RAM is time and money you didn’t invest into more generally useful things.[/QUOTE]That’s a fair point. The RO-texture RAM was only an example idea of mine though. I have no idea what IHVs would want to add.

[QUOTE=Alfonse Reinheart;42368]they simply provided stencil buffers and depth comparison texture fetches.[/QUOTE]I don’t see why you don’t classify them as graphics improvements. They’re new features that improve graphics. Ok, they need programmers to enable them - is that what you’re getting at here?
But I’m sure there are also under-the-hood improvements to many areas - eg anisotropic filtering algorithms - that don’t need any programmer input.

[QUOTE=Alfonse Reinheart;42368]The only reason GPUs haven’t gotten rid of those particular things you cite is because they’re sufficiently performance-critical and high-performing that generic operations would be slower and thus impact performance. And it’s not like TMUs have not gotten simpler; some hardware does anisotropic filtering as primarily a shader-driven computation process. Also, as I understand it, TBRs don’t have ROPs of any kind.[/QUOTE]It’s not all going the same direction. E.g. GPUs are also adding fixed decoders for new texture compression algorithms.

[QUOTE=Alfonse Reinheart;42368]OK, let’s follow your idea down to its inevitable conclusion.
…
Which means if you want to write code for both kinds of platforms, you have to do a lot of work.
…
[/QUOTE]

I don’t think it’d be as hard as you suggest. There’s no need for different shaders. Most of it is just moving what’s being removed into the extension. Most of the other fields you’re supplying now could simply be supplied to a new structure via the pNext parameter.
Yes it would be an almost-mandatory extension for TBRs, but it would mean non-TBRs could ignore it all. Plenty of stuff is made that will never see a TBR, and yet they have to do a lot of work that won’t be used.

[QUOTE=Alfonse Reinheart;42368]You can argue about whether some ideas in Vulkan are good or bad, whether improvements could be made towards some purpose. But its render pass system is neither a mistake or a compromise. It’s a good thing for everyone; having a single API that works across TBR and non-TBR platforms efficiently is good.[/QUOTE]Significant extra developer workload is a downside. A significantly steeper learning curve is the downside. Forcing concepts that are irrelevant to non-TBRs onto non-TBR apps is a downside.
Many developers will now look at D3D12 and Vulkan and choose D3D because it’s less work, which then harms non-D3D platforms. Vulkan has a reputation for being tricky and, going by comments, renderpasses are a big contributor.
I’d probably be using D3D12 now if it were cross-platform, and I bet a lot of other devs would be too - the ones that don’t care about TBRs.

[QUOTE=Alfonse Reinheart;42368]And it’s not like the render pass model is useless for non-TBR hardware. Read/modify/write operations were possible on older APIs, but these were esoteric things that required lots of knowledge and care. In Vulkan, input attachments take the arcane nature of them and make them much simpler and easier to use.[/QUOTE]I find them less arcane than renderpasses, simply because they map better to my understanding of the underlying mechanisms.

Is D3D12/13 going to add renderpasses? I doubt it.

[QUOTE=Alfonse Reinheart;42368]In their defense, the 32 memory type limitation isn’t really that bad. Let’s look at the numbers …[/QUOTE]Well, it was your example. But I did agree - mostly because hard limits are so out of vogue, and because of the “64k is enough was wrong” mantra.

Possibly, but there’s tons of industry groups out there that get big things wrong - the RIAA for instance.
I’m sure there’ll be a lot of changes to Vulkan in the future. That means things aren’t perfect now. OpenGL made a ton of early changes.
Perhaps I didn’t need to say it though. The only reason I said that is because it’s quoted all over the place that they aimed to make an ultra-low-level API that closely mapped to current hardware.

Alfonse_Reinheart · May 27, 2017, 12:48pm

How do you define “misstating”? The whole problem with hints is that the API itself doesn’t know its being misused, because hints can’t explicitly define what “misused” would be. The user only finds out because they’re getting inadequate performance. So there can’t be a layer that reports something back.

And users have no real way to fix these problems besides just trying all the hints and seeing which works fastest. Which they would have to do for every implementation. And every driver release would potentially change what works fastest.

If you define “misused” explicitly enough that there is no ambiguity, then what you find is that you have to plan for every usage pattern. Either that, or you’ll have usage pattern “holes”. Like maybe someone updates a buffer frequently, but not every “frame”; it’s every 1-3 frames, depending on what’s going on. Do you have a pattern to fit that?

Heuristics do not work. And usage patterns are heuristics. The only people who know enough to be able to correctly place allocations are the IHVs and the programmers behind the application. So Vulkan doesn’t try to get between the programmer and the implementation’s memory pools.

Um, no it isn’t. History has some things that didn’t work for ages until someone found a way to actually make them work. But none of your suggestions are any different from things that have already been tried. So where is the unexplored space in this domain that would lead to a solution?

At some point, you have to cut your losses and stop trying to get the non-working solution to work.

No. You don’t just turn on the stencil test and magically get stencil shadows. You get stencil shadows by rendering things in a very specific way. You have to implement the algorithm, which involves using the stencil test, but also involves many other things (like forward rendering). Just like the ability of a CPU to perform arithmetic does not make windows appear; a programmer has to actually use them in a specific way.

That “specific way” is what makes the graphical improvement actually happen. We had stencil testing in hardware for years before people discovered the stencil shadow algorithm.

Yes, you would need different shaders; that’s not even debatable. Input attachments would not be a thing that exists without the render pass concept. And input attachments are a different kind of thing from a regular image fetch, which is why they have a different shader “dimension” in SPIR-V from any other image type. So there’s no way to avoid having to make shader changes when writing cross-platform applications.

Let’s turn this around. Let’s say that you know for a fact that all of the customers you’re interested in use NVIDIA hardware. That means you can take full advantage of NVIDIA’s OpenGL extensions. So you write your programs to use NV_vertex_buffer_unified_memory, NV_shader_buffer_load, and so forth. You don’t use Vertex Array Objects, Uniform Buffer Objects, SSBOs, and so forth. Maybe you even believe that these things “map better to my understanding of the underlying mechanisms” than OpenGL’s abstractions.

Is that sufficient justification for saying that OpenGL shouldn’t have VAOs, UBOs, and such? That OpenGL should have just been NVIDIA_GL, with AMD/Intel-based extensions for their hardware stuff?

Because that’s what you’re saying. You’re pointing at an entire class of hardware and saying, “no, your hardware isn’t important enough to be directly supported by our abstraction. Go make your own.”

If you have the luxury of ignoring those classes of hardware, that’s fine. Other people don’t. But why should your luxuries dictate what Vulkan is? Why shouldn’t an API exist that is cross-platform across TBRs and non-TBRs, where you can write code for both platforms that will work equally efficiently on both?

Can you provide any foundation for this claim? I am not aware of any particular developer backlash against Vulkan for anything that could not just as equally be attributed to D3D12. And do be aware that ISVs were also a part of Vulkan’s development, so they had plenty of input in this regard as well. If render passes really are “a big contributor” to problems in the API, why didn’t they solve them back then?

I’ve seen plenty of presentations on Vulkan from a large number of engine developers. None of them seem particularly critical of the render pass architecture.

Indeed, it seems that “many developers” is not StarDock, who announced Vulkan support for Ashes of the Singularity, despite being a Windows-only game (at present. That could change, but there’s no immediate push to do so). Nor does it include CIS, who decided to ditch Direct3D entirely, in favor of Vulkan. Indeed, they made the statement, “The APIs really aren’t that different though, 95% of the work for these APIs is to change the paradigm of the rendering pipeline, which is the same for both [DX12 and Vulkan].”

So where is this disapproval and flight from Vulkan of which you speak?

And who exactly would those people be who could ignore mobile platforms? It wouldn’t be the makers of Unity. Or Epic. Or most other game engines. The mobile market is not something most engine developers can just ignore these days.

Also, let’s not forget that the game development market is changing. The fact is, most smaller game developers don’t care about render passes or things like that. They’re using off-the-shelf engines to get their game out the door. Trivialities like Vulkan or D3D12 matter only in whether their game will run on the target platform.

Engine makers and high-end game developers are the people that care about things Vulkan and D3D 12. And I rather suspect that they care that porting to mobile platforms, even if they’re not taking advantage of it right now, will not be a significant burden in the future.

And that’s falling in line with your belief that render passes offer nothing to non-TBRs. Actual IHVs have said that even non-TBR hardware can take advantage of aspects of the render pass architecture.

As someone who has explained how texture barriers work, and has tried to explain how the render pass model/input attachments work, I find the latter is a hell of a lot easier. Once you understand the basic model of the render pass architecture (ie: hardware can only render to internal rendering memory, rather than texture memory directly), it’s quite simple.

xaxazak · May 27, 2017, 10:59pm

[QUOTE=Alfonse Reinheart;42370]How do you define “misstating”? The whole problem with hints is that the API itself doesn’t know its being misused, because hints can’t explicitly define what “misused” would be. The user only finds out because they’re getting inadequate performance. So there can’t be a layer that reports something back.[/QUOTE]Measure usage. I guess it’d be pretty tricky to measure shader usage of resources though unless you altered the shaders to count uses. But there are other tools that can do similar stuff. Of course there are usage situations that won’t work well, but most should. But the developer is at fault if they give bad usage info, just like they’re at fault if they write inefficient shaders or allocate resources poorly.

We talk, in Vulkan, about giving the app developers more control because they know better what they’re doing. Now you’re saying they don’t know better and we should avoid giving them chances to make mistakes. Is there some reason why they suck at figuring out approximate usage but are decent at everything else? Can we not fix that with a bit of education and clearer documentation?

[QUOTE=Alfonse Reinheart;42370]And users have no real way to fix these problems besides just trying all the hints and seeing which works fastest. Which they would have to do for every implementation. And every driver release would potentially change what works fastest.[/QUOTE]If the drivers work right, all developers have to do is be somewhat accurate. There’s no reason they should need to try different hints unless the drivers are getting things wrong (which I guess will probably happen to some degree).

[QUOTE=Alfonse Reinheart;42370]If you define “misused” explicitly enough that there is no ambiguity, then what you find is that you have to plan for every usage pattern. Either that, or you’ll have usage pattern “holes”. Like maybe someone updates a buffer frequently, but not every “frame”; it’s every 1-3 frames, depending on what’s going on. Do you have a pattern to fit that?[/QUOTE]If it’s measured in bytes per second, that’s most of what you need. You’re not going to describe all the usage quirks, but it’s still far better than zero usage information.

[QUOTE=Alfonse Reinheart;42370]Heuristics do not work. And usage patterns are heuristics. The only people who know enough to be able to correctly place allocations are the IHVs and the programmers behind the application. So Vulkan doesn’t try to get between the programmer and the implementation’s memory pools.[/QUOTE]I’m not asking for Vulkan to decide. At present, of the two you mention only one is being asked, the other (IHV) is being ignored. I want both to have input - best would be some hinting extension so that the programmer still has final say.

[QUOTE=Alfonse Reinheart;42370]Um, no it isn’t. History has some things that didn’t work for ages until someone found a way to actually make them work. But none of your suggestions are any different from things that have already been tried. So where is the unexplored space in this domain that would lead to a solution?[/QUOTE]They were tried in the same way that different lightbulb components were tried. Yes, they didn’t work. No, we don’t have a good reason to say they can’t work. One is simply failing because otherwise competent developers are getting mixed up - surely that’s a sign of something that can be fixed. The other, has RO-texture memory (or similar) been tried? And it was only an example anyway.

[QUOTE=Alfonse Reinheart;42370]No. You don’t just turn on the stencil test and magically get stencil shadows. You get stencil shadows by rendering things in a very specific way. You have to implement the algorithm, which involves using the stencil test, but also involves many other things (like forward rendering). Just like the ability of a CPU to perform arithmetic does not make windows appear; a programmer has to actually use them in a specific way.[/QUOTE]That wasn’t my point. You were saying IHVs aren’t improving graphics features, only increasing performance of current features. But stencils etc were new features. They were part of a solution to a problem, the programmer had to do the other part, sure, but that’s true of most new technology. SIMD CPU extensions didn’t help existing code, programmers had to implement them. Yet it was a new feature that let compilers and coders improve efficiency of certain math tasks.

[QUOTE=Alfonse Reinheart;42370]Yes, you would need different shaders; that’s not even debatable. Input attachments would not be a thing that exists without the render pass concept. And input attachments are a different kind of thing from a regular image fetch, which is why they have a different shader “dimension” in SPIR-V from any other image type. So there’s no way to avoid having to make shader changes when writing cross-platform applications.[/QUOTE]Well, currently we have one shader that works for both. There are potential exploitable benefits on non-TBR hardware to describing input attachments as different from textures in the shader. So just leave them as is. You’re right that that is keeping a change that was intended for TBRs, but you could argue they’re just an all-round benefit.

OpenGL initially was targeted at different rendering techniques. You could raytrace it. They soon realized they had to specialize if they wanted to improve performance and features.

I’m not saying “go make your own”, I’m saying - if you need extra stuff specific to you don’t force it on everyone that doesn’t need it. Have it as an extension. The extension won’t be much more complex than the stuff that’s already there.
Most of the stuff will be the same, just extension calls rather than API calls, data in pNext rather than the initial structure. Little extra work (compared to the current Vulkan) - if you keep the shaders the same.

[QUOTE=Alfonse Reinheart;42370]Can you provide any foundation for this claim? I am not aware of any particular developer backlash against Vulkan for anything that could not just as equally be attributed to D3D12. And do be aware that ISVs were also a part of Vulkan’s development, so they had plenty of input in this regard as well. If render passes really are “a big contributor” to problems in the API, why didn’t they solve them back then?[/QUOTE]Because TBR manufacturers have money and influence. The downside of an extension (rather than a core requirement) for them is that some games that could potentially work on TBRs won’t because the developers aren’t interested in doing the extra work to implement the extension. I don’t see that as unfair, but they’d see that as a big decrease in potential revenue.

[QUOTE=Alfonse Reinheart;42370]I’ve seen plenty of presentations on Vulkan from a large number of engine developers. None of them seem particularly critical of the render pass architecture. … So where is this disapproval and flight from Vulkan of which you speak?[/QUOTE]As I see it, big studios aren’t the ones who’d have problems. It’s small-time developers, university courses, people wanting to teach themselves online, etc.
I wouldn’t think you’ll hear much disapproval (on forums you’ll hear more), you’re just going to get fewer people considering or attempting Vulkan. It won’t be a flight of course as it’s new technology, it’ll just be reduced adoption.

[QUOTE=Alfonse Reinheart;42370]And who exactly would those people be who could ignore mobile platforms? It wouldn’t be the makers of Unity. Or Epic. Or most other game engines. The mobile market is not something most engine developers can just ignore these days.[/QUOTE]Everyone currently making D3D stuff. GPU-heavy games like Doom. Anyone who wants to use shading techniques (like depth of field?) that don’t suit TBRs. But middleware engine developers have less to lose and more to gain so they’re less likely to have issues with it.

[QUOTE=Alfonse Reinheart;42370]Also, let’s not forget that the game development market is changing. The fact is, most smaller game developers don’t care about render passes or things like that. They’re using off-the-shelf engines to get their game out the door. Trivialities like Vulkan or D3D12 matter only in whether their game will run on the target platform.[/QUOTE]But is that what we want - game developers just using a jack-of-all-trades middleware that reduces flexibility, potential performance, understanding - and means there’s far less development effort going into writing engines which means less innovation.

[QUOTE=Alfonse Reinheart;42370]Actual IHVs have said that even non-TBR hardware can take advantage of aspects of the render pass architecture.[/QUOTE]I heard the AMD people give an example of scheduling clears and similar things. That seems like something that should be achievable elsewhere, perhaps using queue priorities. And they’re acting off less information than the developers have - they don’t know how long a subpass will take. I haven’t been following IHV comments very closely though. Are there other examples?

[QUOTE=Alfonse Reinheart;42370]As someone who has explained how texture barriers work, and has tried to explain how the render pass model/input attachments work, I find the latter is a hell of a lot easier. Once you understand the basic model of the render pass architecture (ie: hardware can only render to internal rendering memory, rather than texture memory directly), it’s quite simple.[/QUOTE]That might be true, I might be a special case - but which gives you a better understanding of what’s actually happening internally?

Alfonse_Reinheart · May 28, 2017, 10:02am

That is a mischaracterization of the situation of OpenGL and heuristics/hints.

The problem with hints is not that app developers don’t know what they want. It’s that hints are the wrong medium for them to adequately describe what they want. To a developer, usage hints are like trying to have a highly technical conversation in Latin. The language simply doesn’t have the vocabulary to describe a computer, let alone technical aspects of programming.

Oh sure, you can muddle through, most of the time. But the result is going to be extremely awkward and compromised. It is most certainly not going to be an efficient conversation.

The memory “vocabulary” that has proven most effective for achieving performance is the “vocabulary” of the actual hardware. The specific pools of memory that can be allocated and their general performance characteristics. The specific ways of allocating that memory for access and usage.

Is this a perfect solution for all possible hardware? No. But it’s a better solution for prior hardware, current hardware, and predictable-future hardware than usage hints. And that’s good enough.

And you still ignore the advantages of the low level approach. Being able to know exactly how much storage of different kinds is available allows developers to know what’s there and to adjust their applications accordingly. They can plan for different scenarios and come up with solutions that work best if contention for limited memory is an issue. Usage hints can’t do that; they’re the wrong vocabulary.

At the end of the day, the purpose of such a hint would be to allow the implementation the liberty to decide where an allocation comes from. But Vulkan is, by design, an explicit API. If you allocate X bytes from memory pool Y, then X bytes are allocated from pool Y.

So what exactly would a hint accomplish? The memory still has to be allocated from pool Y, since that’s what the user said to do. So the only way a hint could change something is if implementation lies to the user. It would have to expose multiple hardware memory pools as a single memory pool, with the implementation selecting the actual hardware pool based on your usage hint.

That’s the exact opposite of being an explicit API. If you want a usage hint based API, then it can no longer be explicit.

I said no such thing. I said, “IHVs have not given us tons of improvements on graphics. Since the advent of programmable hardware, IHVs have given us more power; it’s the users who have given us “a ton of improvements” towards better graphics, using the tools IHVs have provided.”

My point is that graphics only gets better by programmers using the tools that GPUs provide. The tools by themselves are not “graphics features”, any more than a hammer is a house.

And the most important tool that IHVs have provided is generality: giving more control over the rendering process to the user. That’s not a “graphics feature”, and yet it is the thing most responsible for the improvements in visual fidelity.

You can’t leave them “as is”, because input attachments don’t make sense without the rest of the render pass machinery. D3D 12 doesn’t have input attachments, for example.

Who are you to decide what is “extra specific” stuff and what isn’t? You’re talking about an entire class of hardware. My analogy holds because it’s the same thing. AMD and Intel hardware are different; they have “extra specific” stuff that NVIDIA does not. Why not excise all non-NVIDIA hardware from OpenGL’s API and make those the AMD/Intel-only parts an extension?

Every abstraction over a range of hardware is going to have “extra specific” stuff in it, some concession to one piece of hardware or another. Render passes barely register on my radar; Vulkan has far more annoying things than that.

Like the primitive topology being a fixed and immutable part of the pipeline. Why? Because some pieces of hardware out there need that. D3D 12 puts the basic primitive type (point, line, triangle, patch) in the pipeline, but the specific primitive used (triangle strip vs. list, etc) is command buffer state. By contrast, Mantle, Metal, and Vulkan put the entire primitive type in the pipeline. Is that due to tile-based renderer needs? Well, Mantle putting it there seems to suggest that the need is broader than TBRs.

Should we consign any hardware that needed the full primitive type at pipeline building time to an “extension”? No. As much as I personally might want to, the gain from doing so is not significant enough to offset the potential costs. It creates a huge distinction between writing to different kinds of hardware.

Now, if all hardware could do it, that’d be great. But if all hardware couldn’t, I’d rather Vulkan’s core API support the lowest-common denominator, with an extension or optional feature allowing the more specific case.

Like other low-level APIs, Vulkan was not made for those people. Small-time game developers should be focused less on the minutiae of their rendering and more on what players actually care about: gameplay. University graphics courses should be focused less on the low-level details that Vulkan uses and more on high-level concepts of graphics. And so forth.

Vulkan exists to serve the needs of big studios and such. Others might be able to benefit, but Vulkan’s primary audience are big developers.

That’s rather off-topic. If you don’t like the changes in the industry, if you think that the enginification of game development leads to less innovation, that’s your prerogative. Granted, there’s plenty of evidence that, by lowing the barrier to entry, the enginification of game development has actually increased gameplay innovation, but if you want to believe that this is a drag on the industry, I won’t debate you.

But my point still stands. Because of these changes in the industry, small-time developers are far less dependent on and concerned by low-level details like OpenGL or Vulkan than they ever have been. Therefore, whether Vulkan uses render passes or not is a detail that is ultimately beneath their notice.

They will not abandon the rendering system because they don’t choose rendering systems at all. They choose engines. Good or bad, that’s reality.

A better understanding of which hardware? Because texture barrier (and the OpenGL memory model in general) gives you zero understanding of what a tile-based renderer is doing internally.

Understanding the abstract model is not supposed to tell you what the implementation is actually doing under the covers. That’s why you’re writing to an abstract model rather than writing your own graphics driver

[HR][/HR]

And now for the rest.

You did not present any evidence for any of the claims you made. Instead, you started describing what is effectively a conspiracy theory. That “TBR manufacturers have money and influence” and thus they have forced non-TBRs to break their API just for the needs of a powerful minority. You offer no evidence that the non-TBRs were forced into anything.

I don’t find conspiracy theories particularly compelling. And this one makes less sense than most, since it portrays AMD and NVIDIA as minor players who can’t stand up against the “money and influence” of TBR makers. NVIDIA, I would remind you, has sufficient “money and influence” in this industry to create their own NVIDIA-specific compute API that is doing far better in the compute market than the supposed industry standard OpenCL. And you expect me to believe that NVIDIA would willingly impose something they thought was bad because Imagination or Qualcomm wanted it?

Occam’s Razor tells is that it’s far more likely that non-TBR developers went along quite willingly with the render pass architecture.

For a more specific example:

But you have NO EVIDENCE FOR THIS! This is all 100% speculation.

You talk about this as though Vulkan hasn’t been out for a year or something. If there were any foundation to what you were saying, then we would already have seen some of the effects of this, right? If render passes were really that terrible of a thing, if their complexity was so great that it genuinely inhibited smaller/individual developers from adopting Vulkan, then wouldn’t we already have seen lower use of Vulkan from these smaller users?

So, let’s try to find some actual evidence. Consider Stack Overflow. SO has a Vulkan tag with 284 questions. SO has two separate tags for D3D12: DX12 and D3D12. The sum total of questions is… 100. And some questions are tagged with both, so even that is probably double-counting.

What about GameDev.StackExchange? The total Direct3D12 and DirectX12 questions is… 12. The Vulkan question count? 13.

And all of that ignores that Vulkan came out a year later than D3D12. So surely, D3D12 naturally should have more questions. And yet, it doesn’t.

You could claim that SO probably attracts more cross-platform developers. You could claim that SO probably attracts more open source developers. You can explain this away with a lot of things, that maybe users of low-level APIs in general either gave up or were clever enough not to need to ask questions.

But my point is that you have no evidence for this phenomenon that you insist will happen any day now. And there is some evidence that it is not happening and will not happen.

xaxazak · May 30, 2017, 4:09am

[QUOTE=Alfonse Reinheart;42375]That is a mischaracterization of the situation of OpenGL and heuristics/hints.

The problem with hints is not that app developers don’t know what they want. It’s that hints are the wrong medium for them to adequately describe what they want. To a developer, usage hints are like trying to have a highly technical conversation in Latin. The language simply doesn’t have the vocabulary to describe a computer, let alone technical aspects of programming.

Oh sure, you can muddle through, most of the time. But the result is going to be extremely awkward and compromised. It is most certainly not going to be an efficient conversation.

The memory “vocabulary” that has proven most effective for achieving performance is the “vocabulary” of the actual hardware. The specific pools of memory that can be allocated and their general performance characteristics. The specific ways of allocating that memory for access and usage.

Is this a perfect solution for all possible hardware? No. But it’s a better solution for prior hardware, current hardware, and predictable-future hardware than usage hints. And that’s good enough.[/QUOTE]We don’t have their general performance characteristics though. There’s no scalar values except size and granularity. No bandwidths to/from different targets. Also, no info about sequential-ness - although that one’s difficult to specify without lots of variables.

[QUOTE=Alfonse Reinheart;42375]And you still ignore the advantages of the low level approach. Being able to know exactly how much storage of different kinds is available allows developers to know what’s there and to adjust their applications accordingly. They can plan for different scenarios and come up with solutions that work best if contention for limited memory is an issue. Usage hints can’t do that; they’re the wrong vocabulary.[/QUOTE]The developer has access to this too. If you wanted to go to extreme detail the hints could actually score each memory type for each hint request. But you’ve already got access to everything in current Vulkan.

[QUOTE=Alfonse Reinheart;42375]At the end of the day, the purpose of such a hint would be to allow the implementation the liberty to decide where an allocation comes from. But Vulkan is, by design, an explicit API. If you allocate X bytes from memory pool Y, then X bytes are allocated from pool Y.

So what exactly would a hint accomplish? The memory still has to be allocated from pool Y, since that’s what the user said to do. So the only way a hint could change something is if implementation lies to the user. It would have to expose multiple hardware memory pools as a single memory pool, with the implementation selecting the actual hardware pool based on your usage hint.[/QUOTE]That wasn’t how I was thinking of doing it. I was thinking of letting the programmer request a hint, totally separate from the actual memory interaction. The memory interaction is explicit, but the programmer could just use the memory type returned from the hint request. It’s a standalone extension that doesn’t change anything.

[QUOTE=Alfonse Reinheart;42375]I said no such thing. I said, “IHVs have not given us tons of improvements on graphics. Since the advent of programmable hardware, IHVs have given us more power; it’s the users who have given us “a ton of improvements” towards better graphics, using the tools IHVs have provided.”

My point is that graphics only gets better by programmers using the tools that GPUs provide. The tools by themselves are not “graphics features”, any more than a hammer is a house.

And the most important tool that IHVs have provided is generality: giving more control over the rendering process to the user. That’s not a “graphics feature”, and yet it is the thing most responsible for the improvements in visual fidelity.[/QUOTE]
I think this is probably just a semantic debate. I think they have given us tons of improvements because I’m counting things like stencil, TMU & ROP improvements, tesselation etc (not saying I like them all) as graphics features. The final real-world improvements are usually a combined effort between the programmer, the IHVs and the API.
And some (e.g. filtering improvements, Z compression, etc) didn’t require changes to the APIs or the applications.
I’m often for the idea of increasing generic-ness, but I don’t mind going the other way if the benefits warrant it. (Plus, things like renderpasses are the opposite of generality).

For the shader code I don’t see why not. You would need to attach the images differently to regular textures. But you certainly don’t need to know the whole renderpass to do so.
And there are many hardware optimizations that you could do with that knowledge - you need less image info and you can potentially access the data in bulk without all the heavy TMU work.

[QUOTE=Alfonse Reinheart;42375]Who are you to decide what is “extra specific” stuff and what isn’t? You’re talking about an entire class of hardware. My analogy holds because it’s the same thing. AMD and Intel hardware are different; they have “extra specific” stuff that NVIDIA does not. Why not excise all non-NVIDIA hardware from OpenGL’s API and make those the AMD/Intel-only parts an extension?

Every abstraction over a range of hardware is going to have “extra specific” stuff in it, some concession to one piece of hardware or another. Render passes barely register on my radar; Vulkan has far more annoying things than that.

Like the primitive topology being a fixed and immutable part of the pipeline. Why? Because some pieces of hardware out there need that. D3D 12 puts the basic primitive type (point, line, triangle, patch) in the pipeline, but the specific primitive used (triangle strip vs. list, etc) is command buffer state. By contrast, Mantle, Metal, and Vulkan put the entire primitive type in the pipeline. Is that due to tile-based renderer needs? Well, Mantle putting it there seems to suggest that the need is broader than TBRs.

Should we consign any hardware that needed the full primitive type at pipeline building time to an “extension”? No. As much as I personally might want to, the gain from doing so is not significant enough to offset the potential costs. It creates a huge distinction between writing to different kinds of hardware.

Now, if all hardware could do it, that’d be great. But if all hardware couldn’t, I’d rather Vulkan’s core API support the lowest-common denominator, with an extension or optional feature allowing the more specific case.[/QUOTE]
It’s a good point I guess. I do acknowledge that I have a sociological bias against mobile (computers used to be tools owned by their owners, but almost all mobile devices are anticompetitive, exploitative walled gardens, and this means people can’t learn computing by toying around like the previous generation did. They’re filled with software that just wants to harvest their private data for profit. I do find it difficult to care).

So I do get that it comes down to what you decide to consider as your lowest common denominator. But the lower you go the more compromises you need to make. Potential new shader types that are totally unsuitable for TBRs, do they get overlooked? or scheduling hints that TBRs can’t follow? I don’t know what’s around the corner, but I know OpenGL ditched many alternate rendering modes when adding new features. APIs for decades have had a higher lowest common denominator, and lowering it when we could’ve just added an extension instead without too much extra cost seems bad.

But sure, it’s a line drawn in the sand, and they chose to draw it somewhere I don’t like.

Getting back to specifics, renderpasses to me just stick out as something totally counter to the goals Vulkan otherwise aims for - explicitness and close-to-the-metal-ness. There’s also lots of annoyances that they inflict that I haven’t mentioned yet, for example:

[ol]
[li]Breaking modularity. You need them when creating framebuffers and pipelines, even though conceptually there’s no need. Wouldn’t most people want to create the basic rendering framework (incl framebuffers) before building scene-specific objects? Yes, you can just create dummy-renderpasses (compatible ones), but it’s a pain. Same for pipelines - you can use them in multiple renderpasses, yet you need to create a renderpass first - a N-to-1 relationship where you need one of the N before you build the 1 - again, dummy-renderpasses solves it.[/li][li]Inflexible subpass list. You can’t use information generated during previous subpasses to choose which future subpasses need to be done (eg running a shader only if a certain surface type isn’t totally occluded) - you can just ignore some but that has costs.[/li][li]Converting software. Almost everything else should map fairly easily from OpenGL to Vulkan, but if you haven’t built your rendering system around a sequential set of passes then there’s lots of refactoring required, for example you need to supply textures at different times, which can mean moving the code that finds them. The rest of the API doesn’t really need much refactoring when converting unless you’re adding parallelization.[/li][/ol]

[QUOTE=Alfonse Reinheart;42375]Like other low-level APIs, Vulkan was not made for those people. Small-time game developers should be focused less on the minutiae of their rendering and more on what players actually care about: gameplay. University graphics courses should be focused less on the low-level details that Vulkan uses and more on high-level concepts of graphics. And so forth.[/QUOTE]I don’t agree, but it is subjective I guess. Previously, small-time developers became big ones via learning stuff while making games. Carmack etc. IMHO we want game developers learning this stuff so that they can have big ideas that cover both engine and gameplay.

[QUOTE=Alfonse Reinheart;42375]Vulkan exists to serve the needs of big studios and such. Others might be able to benefit, but Vulkan’s primary audience are big developers.[/QUOTE]Do we need another new cross-platform API then for the others? OpenGL is decrepit.

But many don’t care about tile-based renderers. Is that wrong, if they only want to work on non-mobile stuff.

[QUOTE=Alfonse Reinheart;42375]You did not present any evidence for any of the claims you made. Instead, you started describing what is effectively a conspiracy theory. That “TBR manufacturers have money and influence” and thus they have forced non-TBRs to break their API just for the needs of a powerful minority. You offer no evidence that the non-TBRs were forced into anything.[/QUOTE]I should’ve used a question mark. I didn’t mean to state it was so. And “money and influence” doesn’t necessarily mean strong-arming anyone. Money just gives them a say - membership to Khronos.

It could possibly be something like “Hey, do you think you could just squeeze in a little something so that we can get an overview of the whole rendering process? It’s really important for us - it’ll really speed things up for us”. “Well ok then, it doesn’t seem to cause us any slowdown and if you really need it”. Done. No hostility, no bad behavior or shenanigans, but still an outcome based on influence.

The gains to TBRs are big. The losses to non-TBRs are small. And the TBRs aren’t just Imagination and Qualcomm but also (indirectly) Apple and Samsung and Google and many other mobile-focused Khronos members. They’re a huge part of the membership so they’re not even really the minority - it’s their API too.

[QUOTE=Alfonse Reinheart;42375]You talk about this as though Vulkan hasn’t been out for a year or something. If there were any foundation to what you were saying, then we would already have seen some of the effects of this, right? If render passes were really that terrible of a thing, if their complexity was so great that it genuinely inhibited smaller/individual developers from adopting Vulkan, then wouldn’t we already have seen lower use of Vulkan from these smaller users?[/QUOTE]Lower than what though? We don’t have a control group.

If you divided up all the learning required for Vulkan, some percentage would be on renderpasses. When migrating from D3D12 it will be a large percentage. When going from OpenGL it will be, my guess, approx 10%. When learning from scratch, again IMHO, approx 4%.
The more there is to learn, the harder it is to learn it, the more its reputation for difficulty increases, the fewer people that try it. Add in a few positive-feedback loops too (support, word-of-mouth, etc). Measuring the actual effect is incredibly hard, and requires surveying.

I’m not sure what your data was showing? D3D12 vs Vulkan doesn’t provide much info. What is the null hypothesis (D3D1-11 vs OpenGL?), more “indie” use = more questions, simpler API better documentation = less questions, more “indie” use = more questions, website AI biases, there’s so much bias (and noise) in both directions I don’t think you can get any info that way.