Contiuation of Github #650

This is a continuation of the discussion that evolved here: Relaxing queue family index requirements on buffer/image barriers. · Issue #650 · KhronosGroup/Vulkan-Docs · GitHub

My apologies for the extended discussion on Github. I did not have a forum account, as I’m normally non-vocal about issues like these.

There is, and has been, confusion about what I’m trying to accomplish and why.

Basic overview of requirements:

  1. The rendering engine must be able to recover from a “soft” device loss (e.g. resolution change) with the cooperation of the client application. Meaning: All objects are re-constructed, but the application must repopulate resources with data.
  2. Plug-ins must be able to interface with the rendering engine at the finest possible granularity (e.g. no one-size-fits-all “mesh” class).
  3. The rendering engine must be usable in two modes of operation: Free-threaded and staged.
  4. In free-threaded mode, entry points must not block client threads for the duration of a transaction being completed on another thread, unless that operation is a Join. The only blocking action permitted is the acquisition of any resource locks required to assemble the transaction for submission.
  5. In staged mode, all resources are under the exclusive control of the rendering engine. No transactions may take place concurrently while a staged mode execution is completing. All entry points are allowed to block client threads until a staged mode execution completes.
  6. All image resources, with the exception of swapchain images, must be usable for any purpose, or it should be possible to configure the image up front at creation time.
  7. All buffer resources must be usable for any purpose (buffers are sub-allocated from arenas, so we don’t have a choice).
  8. The rendering engine may take advantage of an application-controlled thread pool for any long-running work. This includes command buffer submission. If provided a thread pool, the rendering engine must distribute as much work across as many threads as it has been configured to (thread pool bindings come with task limits, etc…).

Some implementation details:
• A task is one of two things: A transaction between the host and device involving a resource, or a compiled display list.
• A display list is a sequence of scoped operations which include renderpass invocations, a series of bare compute program invocations, and device-local resource transfers. When compiled, a display list will automatically detect dependencies between its constituent scopes, and may re-order them to minimize the number of queue submissions. This includes determining where to insert resource and memory barriers. All internal command sequences and scope barriers are written into secondary command buffers, where only those affected by an external change of any resource are re-compiled prior to submission.
• A resource may only be accessed by the host if it is mapped. A resource must be “Fetched” prior to reading any of its contents on the host, and “Stored” if any changes made by the host need to be reflected on the device. This may involve only flushing part of a memory mapping, or a host<->device copy, which requires a set of queues.
• A “Fetch” or a “Store” is a transaction task (the API exposes two functions Fetch and Store, which pull from various pools to assemble a transaction). The internal transaction object consists of a workspace containing multiple buffers of frequently needed API structures, and upon submission, a set of queue controls optionally endowed with VkCommandBuffers and VkCommandPools allocated for each respective family, but if requested by the task - display lists have their own command buffers, but may request additional buffers from the task engine if resources need to undergo ownership/layout transitions prior to invocation. A task is given exactly one queue from every family it has requested.
• All tasks are guaranteed exclusive access to all VkCommandBuffers/Pools/Queues they have requested for the duration of their onPush calls. Again, tasks are guaranteed exclusive access to all related transients.
• The precise manner in which a task uses the queues it has requested is entirely task-specific. For example, a display list will simply iterate over all command buffers destined for each queue in an optimal order respecting dependencies between scopes.

Problems:
• VkXXX structures are “rigid” in the sense that we can’t get away with just putting them inside another structure and having an array of those instead. For example, with VkXBarrier, the display list scheduler needs to keep track of pipeline stage masks, object references, and various other pieces of dependency information in addition to all of the barrier data, and having that in a parallel array is causing a noticeable problem with cache pollution.
• Pursuant to requirement #1, we need to keep a considerable number of VkXXX initialization structures around, which present similar difficulties to barrier structures when, for example, performing a validation pass on a display list.
• This goes all the way back to the first appearance of indexed drawing. When dealing with dynamic geometry (e.g. an adaptive mesh), we need to have more information in an element than just indices, and parallel arrays are a source of heap fragmentation and cache-related performance issues. Consequently, a very large amount of index data needs to be duplicated in a very cache-unfriendly way (3 shorts or ints do not line up nicely with cachelines). Having a stride on an indexed draw operation would have solved this problem.
• VkFences are halfway useful. We need a way to signal a fence from the host, or we need to use some kind of completion port or epoll handle to manage concurrent task completion. The use of the fence-then-select paradigm currently leads to a problem where a task pop thread will end up waiting on a batch of fences obtained from long-running tasks, but fail to respond to the completion of a short-running task. This is currently handled by having one pop thread per priority level (there are currently 8). The amount of work involved in a task pop is miniscule, and having multiple threads for this purpose is wasteful.

Concerns:
• I feel there is too much focus on interactive media. Normally, I’m not vocal about the concerns of other industries, but we’re looking at a future where the hardware we have to work with does not meet our requirements, and does so in such a way that we need to approach those inconsistencies with grossly inefficient patterns or to use unpredictable libraries. I would like to have seen “graphics on compute” instead of “compute shoehorned into graphics”. There is an enormous application domain awaiting a tetrahedron voxelizer that can work directly with index quadruplets. Rasterization just doesn’t cut it here.

It kinda feels like you’re trying to replicate what Khronos was trying to get away from. A library that can do everything out of the box, while it feels like Vulkan is designed for engines built around some specific use-cases.

We need a way to signal a fence from the host

vkEvent is basically a fence that can be signaled from the host, if I’m not mistaken.

I would like to have seen “graphics on compute” instead of “compute shoehorned into graphics”. There is an enormous application domain awaiting a tetrahedron voxelizer that can work directly with index quadruplets. Rasterization just doesn’t cut it here.

It literally always has been this way. You have some hardware that can do array processing well. Since triangle rasterization cannot be efficiently mapped to array processing, there is a middle man that turns triangles into arrays that can be fed back into the array processing HW. There is literally nothing else in a GPU that is meant to work specifically with graphics. And if hardware designed for triangle rasterization does not cut it, why are you even using APIs which sole purpose is to access triangle rasterization hardware? Just use OpenCL or CUDA.

There is, and has been, confusion about what I’m trying to accomplish and why.

Well, that confusion is partially because you never explain what it is you’re actually trying to accomplish. Indeed, even here in this thread, where you’ve given the clearest description yet of what you’re writing, you’ve left out one of the most important things about it:

What it is.

From the description, I can deduce that you’re writing some kind of general-purpose rendering middleware. But that sort of thing ought to be the very first thing you say, not something that should have to be deduced from the list of requirements.

Your description essentially amounts to, “I want to write OpenGL, but with more threading and better performance”. As I mentioned on GitHub, Vulkan is simply not a good means to do that. Vulkan as an API is designed for applications that have some basic, fundamental control over what they’re doing. The code that interfaces with Vulkan may not exactly know how much data is being rendered or whatever, but that code should impose some form of structure on the higher level code.

Because that’s how most high-performance rendering applications work.

Also, some of your requirements work against high performance. In particular:

  1. All image resources, with the exception of swapchain images, must be usable for any purpose, or it should be possible to configure the image up front at creation time.
  2. All buffer resources must be usable for any purpose (buffers are sub-allocated from arenas, so we don’t have a choice).

Being able to use any image/buffer for any purpose is essentially throwing away potential performance for the sake of user convenience.

OpenGL does this sort of thing a lot, and it directly leads to that API’s performance inconsistencies. You can put an image in an FBO, render to it, then immediately turn around and read from it without giving a second thought to the performance implication of the magnitude of that task (in terms of hardware synchronization and so forth). Even your simple “use an image for any purpose” idea is dangerously non-performant.

Vulkan achieves relatively consistent performance by forcing you to decide when such things need to happen. By making you explicitly do them, you are saying when you want that slowdown.

The same is true for image/buffer usages. For an application where the Vulkan-interfacing code has some idea of what that buffer/image is for, there is no reason for it to provide usage values outside of how that buffer/image will actually be used. So if you need to change usages mid-stream, you can do that, but you have to explicitly create a new buffer and copy its data manually. That makes sure that you know that what you just did will be slow.

I’m working on a blog-style article about why Vulkan needed to exist, which goes into details about the problems with “immediate rendering APIs” (stuff like OpenGL and pre-12-D3D). It’s all really way too big to post here, but one of the key points is removing variances like these.

For applications which actually know what a particular buffer or image means (this is a buffer for vertex data, that is a buffer that I’ll be writing vertex data into, this other one is a buffer for uniform data, etc), Vulkan is a boon. By being able to specify these things to the low-level API, the implementation has every chance of informing the user about their capabilities.

For example, if the hardware doesn’t allow an SSBO to live in non-device-local memory, but does allow vertex buffers to be non-device-local, that’s really important to know. If I’m streaming vertex data from the CPU, I probably want to use non-device-local memory. If I’m streaming vertex data from the GPU (generated via a compute shader, for example), then I need to use device-local memory. So my memory budgeting needs to take that into account.

Vulkan thrives in this kind of environment. Of course it does; it was written specifically for this kind of environment. However much you may dismiss such programs, there is a lot of code out there that knows more about how its buffers/images will be used than the driver does.

In an environment where the Vulkan-facing code has no idea what a particular buffer/image means or will be used for, Vulkan’s explicit nature works against you. You have to pick lowest-common-denominator settings. And if hardware doesn’t allow for the lowest-common-demoninator settings, you have to find ways to work around it through slower processes. You have to keep track of what layout images are in. And so forth.

Image layout is a great example of this. In an environment where you know what an image is for, its layout is never really in question. If an image is a typical texture, then 99.9% of its time, it will be in the SHADER_READ_ONLY_OPTIMAL layout. It will only ever not be in that layout when it is first created or is being actively uploaded into. And the latter circumstance is essentially ephemeral: the process that initiates the upload would likely be required to transition the layout from read-only into the destination and then back to read-only again.

Most texture images will never be used as render targets. Indeed, in many engines, the code path for creating a render target image is essentially alien to the code path that creates a texture. In the latter case, they get created due to loading a model or from streaming data for a part of a scene. In the former case, they’re a fundamental part of the renderer, created to serve a very specific purpose in the scene. Using a shadow map has nothing to do with which objects exist in the scene; it’s based on how the renderer decided to render shadows.

So in controlled environments like these, dealing with image layout transitions is quite simple.

Your overall problem is that your program’s goals and Vulkan’s design goals are mismatched. Vulkan is about empowering applications that know what they’re doing, so that they can efficiently communicate that to the hardware. Your application explicitly doesn’t know what rendering will happen. Your program is trying to provide a hardware abstraction that allows users to use the API without communicating any form of meaning.

Decades of OpenGL and D3D have proven that such an abstraction cannot achieve consistent performance. Even ignoring how Vulkan gets in your way, even if you could code right to the metal, without application-specific knowledge of what all of these pieces of memory actually mean, there is no way you can achieve consistent, cross-platform performance.

All that being said, there are a couple of other misconceptions to be addressed:

• VkXXX structures are “rigid” in the sense that we can’t get away with just putting them inside another structure and having an array of those instead. For example, with VkXBarrier, the display list scheduler needs to keep track of pipeline stage masks, object references, and various other pieces of dependency information in addition to all of the barrier data, and having that in a parallel array is causing a noticeable problem with cache pollution.
• Pursuant to requirement #1, we need to keep a considerable number of VkXXX initialization structures around, which present similar difficulties to barrier structures when, for example, performing a validation pass on a display list.

You don’t need to keep any “VkXXX initialization structures” around. These can be created as needed on the stack. Remember: unless otherwise stated, any pointers that you feed to a Vulkan interface function will have been completely used by the time the function returns. So there’s no need to keep those objects around. Create them on the stack, call Vulkan functions, and you’re done.

Most of the time, such structures should be ephemeral.

• This goes all the way back to the first appearance of indexed drawing. When dealing with dynamic geometry (e.g. an adaptive mesh), we need to have more information in an element than just indices, and parallel arrays are a source of heap fragmentation and cache-related performance issues. Consequently, a very large amount of index data needs to be duplicated in a very cache-unfriendly way (3 shorts or ints do not line up nicely with cachelines). Having a stride on an indexed draw operation would have solved this problem.

Well, hardware can’t do that. Index lists are a function of the hardware, and it doesn’t have the ability to have a stride.

You can implement that yourself, using VertexIndex to fetch from a strided index array accessed through an SSBO. But the hardware isn’t implementing it for you.

• I feel there is too much focus on interactive media. Normally, I’m not vocal about the concerns of other industries, but we’re looking at a future where the hardware we have to work with does not meet our requirements, and does so in such a way that we need to approach those inconsistencies with grossly inefficient patterns or to use unpredictable libraries. I would like to have seen “graphics on compute” instead of “compute shoehorned into graphics”. There is an enormous application domain awaiting a tetrahedron voxelizer that can work directly with index quadruplets. Rasterization just doesn’t cut it here.

I think you’re looking at the relationship between Vulkan and hardware incorrectly. Vulkan does not define the hardware; the hardware defines Vulkan. Much of Vulkan is low-level, so most of Vulkan’s design decisions are based on exposing hardware capabilities in a cross-platform way.

If you cannot easily do something in Vulkan, that is because some or all of the hardware Vulkan abstracts cannot easily accommodate such things. Why is there no strided index list in Vulkan? Because no hardware exposes it. Why are renderpasses so hard-coded and attached to pipelines and such? Because there is hardware that needs this abstraction to be efficient. And so on.

If hardware becomes popular which can perform “a tetrahedron voxelizer that can work directly with index quadruplets”, then Vulkan will adapt to support it. But Vulkan does not tell the hardware what to do; it exposes what hardware can do.

As for the “focus on interactive media”, you’re misinterpreting Vulkan’s design in this regard. It’s not necessarily about “interactive media” (though I’m not sure what you would be doing with rendering that is not in some way “interactive media”. Even displaying an MRI or whatever is still “interactive media”). What it’s about is working in an environment where the Vulkan-interfacing code has some idea of what’s going on.

Also, FYI Salabar:

It literally always has been this way. You have some hardware that can do array processing well. Since triangle rasterization cannot be efficiently mapped to array processing, there is a middle man that turns triangles into arrays that can be fed back into the array processing HW. There is literally nothing else in a GPU that is meant to work specifically with graphics.

No, it has not “literally always has been this way”. While this may be an accurate description of a modern GPU (and I’m not even sure it’s that accurate, especially with respect to TBRs, pseudo-TBRs, and the like), it certainly has not “always” been how GPUs worked.

Even on a modern GPU, there are still plenty of operations “meant to work specifically with graphics”. ROPs perform graphics operations like blending, logic ops, etc. The fixed-function vertex processing (including clipping) are very much graphics operations. Specialized depth-culling hardware is very much a graphics operation.

GPUs are not just rasterizers with free-form programming logic. While the free-form programming may be the bulk of a GPU’s die, there’s still plenty of fixed-function graphics happening there.

My apologies for the extended discussion on Github. I did not have a forum account, as I’m normally non-vocal about issues like these.

Yeah, we should strive to keep the GH clean. An engineer borrowed by Khronos for one day a week or something won’t probably read our extended philosophical tractates; it’s is just asking for the Issue to be closed without a reason.

In here though, we can open the floodgates. And it seems yours was already overflowing.

<hr>

What I feel is missing here are some Vulkan paradigms. Let’s have a non-exhaustive list:

  1. Vulkan is minimal. If an app/lib can do something equally well, Vulkan should not do it itself.
  2. In Vulkan you do not pay for what you do not (intend to) use.
  3. In Vulkan performace should be predictable. (i.e. calling something with some parameters should have comparable performace; it should not sometimes trigger GC unpredictably or something which would cause a hitch).
  4. Vulkan does not memoize state unless it has to. If it has state, it tries to be constant object-oriented state where possible.
  5. Vulkan tries to support all contemporary HW.

We can discuss whether those are the right goals to strive for. But all of those are derived from OGL criticisms, so they are probably directionally right.

<hr>

I don’t neccessarily see how your Problems directly follow from your Requirements.
One could possibly contest some of those requirements; I reserve that right for later. For now I would just attack it from conceptual level:
This reeks a bit of top-down engineering. Problem is, this domain requires bottom-up engineering. Vulkan has to conform to whatever the contemporary HW in the wild is (no matter how good the users are with soldering iron, they won’t be able to change their current GPUs.). Consequently the app/lib must conform to whatever Vulkan has, and the requirements must be build with that in mind.

Vulkan is different from OpenGL, which means apps and libs on top of it needs to be reengineered as well (to be efficient). The abstractions have to be build from bottom-up again, in order for them to be zero-overhead (or close to that).

Some of the requirements also seem like they would require violation of the above Paradigms in Vulkan API (e.g. 7 can require to give up Paradigm 2 and/or 5).

<hr>
Ad Problems:

VkXXX structures are “rigid” in the sense that we can’t get away with just putting them inside another structure and having an array of those instead.

I do understand this. The problem is lack of better alternatives. And I am not sure there are even many precedents trying to solve this. Most C APIs are “suck it up and gimme the data in the layout I expect” style.
So, the current state requires you (in worst case) to copy out the structures, or to keep the custom metadata elsewhere. Well, copying is fast. It is nice for cache in the driver (it only has sequential array of only those data it needs). Keeping the metadata separate may not be so bad either (the only hypotetical risk is that it may contest for the same cache line, theoretically).

So, what are the alternatives??

The C++ is solving this using iterators. You can access whatever, wherever with them. You can tweek if you need stuff to be sequential in memory, sequential only abstractly, random accessed, or whatever your imagination can come up with. The problem is they have overhead if they cannot be inlined (and they cannot, because we are on the dll library boundary here). Secondly we are in C, not C++ which makes it a pain to implement in a useful way. We only moved above app overhead to the driver, and made the driver more complex at the same time. Bad.

I think you proposed the Vulkan should own the structures and you would set them with vk commands. Well that is basically making a copy of your own data. Except if the setters are per-field, that’s horrible compared to simple nice (almost) sequential copy. Also the calls cannot be inlined, so all the API calls would cost you much more than a simple copy. Also the structs must be allocated and cannot be simple stack objects. Extra bad.

Providing void* array and a stride seems like a viable compromise. Well it may have some overhead (i.e. the stride is a variable, not constexpr like for normal pointer). It seems horrible to use (with all the void* instead of proper types, and all the sizeofs for extra parameters). And lets remember we are talking the worst case, so regular user using the structures in a normal way would have to pay for something he does not use (paradigm 2). Also for me it is not really settled if the app cannot do it same or better (paradigm 1), though I yielded that point in the first paragraph of this section assuming you may need to copy. It would require some extra convincing for me to subscribe to this idea (I am not Khronos, so it does not matter; but they will need some rationalization too to sell this).

We need a way to signal a fence from the host

Why? That’s against paradigm 1. Using VkFence to signal from host to some wait also on host is well outside the scope of Vulkan. Seems like a work for std::condition_variable or some such.

The use of the fence-then-select paradigm currently leads to a problem where a task pop thread will end up waiting on a batch of fences obtained from long-running tasks, but fail to respond to the completion of a short-running task.

Why? vkWaitForFences has both modes. If you set waitAll to false, then it pops the task that finished sooner.

we need to use some kind of completion port

Seems to me it should be posible to implement a completion port from the primitives Vulkan provides (paradigm 1).

I feel there is too much focus on interactive media.

It is not like we consider it a pinacle of humankind and only thing ever worth doing.
The thing is that “interactive media” got so complex that it practically does whatever technique or use case you would find in other kind of apps. Therefore the sentiment “it works well for games, so it is probably OK generally”.

we’re looking at a future where the hardware we have to work with does not meet our requirements

Vulkan is not a “specification by comitee” as DX sometimes tends to be. If you are worried about hardware, you need to talk to the HW vendors. Vulkan will always follow whatever the HW is concurrently. It cannot afford to alienate users by saying that their GPU is “too old”, much less dictate what next gen GPUs will be and wait until they are released (if ever).

I would like to have seen “graphics on compute” instead of “compute shoehorned into graphics”.

I used to think like that, but if you think about it it actually matches reality. Compute is basically glorified fragment shader. Graphics is a superset of compute. It contains all the generalized compute stuff + some very specialized graphics accelerators that are otherwisely useless for anything else. It is typical generalize, but slow vs specialized, but fast.

And well, it is shoehorned. People buy GPUs to play (graphic) games, not to compute last digit of Pi or something. I am sure you could find a pro card that only implements OpenCL for some supercomputer, but it probably is not for us mere mortals.

There is an enormous application domain awaiting a tetrahedron voxelizer that can work directly with index quadruplets. Rasterization just doesn’t cut it here.

And what does? The reason surface representation with triangles won out is because it is infinitely more efficient.
Still doable though. Hell, Minecraft is just buncha voxels.

Some of your quotes are out of order.

[QUOTE=Alfonse Reinheart;42985]Well, that confusion is partially because you never explain what it is you’re actually trying to accomplish. Indeed, even here in this thread, where you’ve given the clearest description yet of what you’re writing, you’ve left out one of the most important things about it:

What it is.

From the description, I can deduce that you’re writing some kind of general-purpose rendering middleware. But that sort of thing ought to be the very first thing you say, not something that should have to be deduced from the list of requirements.[/QUOTE]

I’m not able to edit the post, it seems.

It is part of a distributed simulation architecture that supports plugins. Plugins are allowed to do just about anything (because they might be wrappers around just about anything). There is a limit to the amount of information I’m able to disclose, though. I’ll admit the stated requirements for resources are a bit hyperbolic, in the sense that if I don’t make that restriction, someone will come along and offer a “but, for what reason?” without contributing to the conversion. I’m not perfect, and I don’t expect perfection of anyone, though I’ve long ago lost my patience with that behavior.

There are two “modes” of usage: Free-threaded and staged. Free-threaded is meant for single-resource or small batch uploads in different parts of the application pipeline and within plugins. Staged can already out-perform its OpenGL+CL counterpart by an order of magnitude (on average), simply because it has explicit control over resources and can deduce an optimal queue submission schedule - something the Vulkan API was designed to allow a human developer to do by hand with a somewhat workable degree of predictability for a particular device. Staged mode is very, very fast. No complaints here about device-side performance.

Every “low-level” task in Vulkan can be automated in such a way that the difference in performance compared to a hand-written implementation is negligible or nonexistent (based on the limited set of devices I tested, anyway). This depends on a few key pieces of information about the device that aren’t readily available from the API, and a couple of design asymmetries (can’t signal a fence - more on that below). I’ve managed to get by with a handful of XML files outlining device characteristics for the machines I’m working with. The general case, however, is typically about 3/4 the latency of OpenGL (parallelizing across shared contexts doesn’t always do what it should do).

“Display lists” basically do all the same work a developer does while building a renderer that handles only the category of resources provided up-front (not set, category - we can swap out image bindings and descriptor sets outside the list, change loop iterations, etc…), and compiles it into a structured sequence of secondary buffers. This includes loops over “descriptor sets” and soon branches. This is why it would be nice to have a standardized glslValidator switch that lets us omit all the “layout” gibberish from shader code (and produce SPIR-V placeholders where they are otherwise expected). Based on how shaders are used in this framework, it is possible to infer which attachments and descriptors are being used and in what manner in the currently bound pipeline from the SPIR-V together with knowledge about where loops will divide descriptor bindings. From that, we can build a renderpass without any additional input - that is the power of the ages-old “bind by name”.

I’ll add that SPIR-V is far and away the best thing that has come out of this (despite the utter uselessness of OpEntryPoint’s reference list), second only to the consistent VkResult return on everything critical.

If I have a chance to effect a change in a direction I believe will offer my particular use-case better performance or ease of implementation, while ostensibly (from my perspective) having little or no negative effect on any other application, I will take it. I’ve dealt with OpenGL since v1.1, more than long enough to say it is overdue for a complete re-write or outright deprecation in light of how the hardware has evolved.

The immediate gain in performance, even with a naiive implementation, is why I’ve spent the past year working with it, and is why I’m carping about nonsensical design obstacles. This is not, by any strech, the uninformed knee-jerk choice it is cast into by the typical response.

[QUOTE=Alfonse Reinheart;42985]You don’t need to keep any “VkXXX initialization structures” around. These can be created as needed on the stack. Remember: unless otherwise stated, any pointers that you feed to a Vulkan interface function will have been completely used by the time the function returns. So there’s no need to keep those objects around. Create them on the stack, call Vulkan functions, and you’re done.

Most of the time, such structures should be ephemeral.[/QUOTE]

Most of them are (anything containing pointers). Attachment-related and barrier-related anything ends up getting re-used, though. I don’t require, or even like, the idea of strided APIs everywhere. I would suggest instead some way to insert a barrier optimization hint in the command stream. vkCmdPipelineBarrier is a very busy function for which an optimal call is difficult to schedule.

[QUOTE=Alfonse Reinheart;42985]Well, hardware can’t do that. Index lists are a function of the hardware, and it doesn’t have the ability to have a stride.

You can implement that yourself, using VertexIndex to fetch from a strided index array accessed through an SSBO. But the hardware isn’t implementing it for you.[/QUOTE]

Already tried that. Its actually slower than copying elements in one-at-a-time and double-buffering it. Still slower than it needs to be.

Time for hardware more appropriate for the task at hand. Yet, we keep pouring money into these unsatisfactory ideas. These APIs have to be designed and used assuming the device is on another planet. The latency alone is why so much work has to be done on the device, and why this simple problem has been given such complex attention. I would much rather find acceptable the idea of writing a software renderer from the ground-up, than having to contort an entire framework to fit in some high-latency Procrustean bed. This will probably not happen within my lifetime, or probably never to spite my mentioning it, but it will not stop me from complaining about it. x86 needs a thorogh once-over as well.

The “on another planet” quip takes us to the VkFence problem. You mentioned VkEvent - I need to use a VkFence to control another thread already waiting on a set of fences, like a condition variable. Not the intended purpose of the object, but a consequence of free-threaded mode. I imagine this can be done with a spare queue, command buffer with a VkEvent, and throwing the VkFence into the wait set every iteration. Cludgey but doable; this isn’t going to be adding latency to anything unless it ends up stuck there for a few hundred milliseconds. Just one more entry point nobody has to import or use, unless they encounter this particular use-case.

Alternatively, we could be using something like a completion port. Create a port, bind a queue to it with a cookie, and whenever something finishes on that queue, the driver will post that cookie to the port. I’ve used completion ports to great effect in high-performance servers, and they are much more responsive and easier to manage than select loops (WaitForMultipleObjects). I see the problem of communicating with a remote piece of hardware, graphics-oriented or otherwise, as no different than managing a connection. This would mean 5-ish more entry points and a new kind of handle nobody is required to use if they don’t need it.

The hardware is precisely what I was complaining about in that statement.

That was daydreaming on my part. 3D textures, as I understand them, would need to be something very different for that to work.

The whole thing is tailored for fixed architectures (games and benchmarks). There is potential in a few places for quite a bit more at what I think is a reasonable expense (or just not present due to the result of oversight).

If OpenGL + OpenCL (or cuda, for that matter) did what I needed, I would have been done with this and moved on to something else last year. Don’t mistake my criticism for condemnation, or pointing to the elephant in the room as naivete. I said nothing about OpenGL during the whole time I had to deal with it, as it was already a smoldering dumpster fire by the time I started using it. Overall, Vulkan is a tremendous improvement. There are just a number of holes that need to be patched.

[QUOTE=krOoze;42986]Yeah, we should strive to keep the GH clean. An engineer borrowed by Khronos for one day a week or something won’t probably read our extended philosophical tractates; it’s is just asking for the Issue to be closed without a reason.

In here though, we can open the floodgates. And it seems yours was already overflowing.[/QUOTE]

No kidding. I’ve already addressed a number of your statements in my last post. This site has an annoying timeout bug, so I’m having to compose my responses elsewhere.

[QUOTE=krOoze;42986]1. Vulkan is minimal. If an app/lib can do something equally well, Vulkan should not do it itself.
2. In Vulkan you do not pay for what you do not (intend to) use.[/QUOTE]

You are preaching to the choir.

Not possible without being able to “prove” your application is running on a particular category of device. Knowing about specific devices is one thing, but being able to classify yet-unknown devices is where we can substantially “future-proof” things at very little expense to the application.

I forgot the mention this in my last post, but I once brought up defragmentation. Technically, an implementation is allowed to embed absolute addresses and other problems within device-local data. This would obviously preclude “in-place, shift down” defragmentation, which is something this framework desparately needs (unless, of course, you don’t mind about 30 seconds of streaming a few GiB out to a network disk and back again). Knowing if this is possible would allow an application to chose the most appropriate strategy for the hardware.

Right now, that isn’t even possible with queue families - we can have an unbounded number of different queue families with random capabilities, and the only way to distinguish them for any purpose is to run mini-benchmarks. A one-time cost of installation, or whenever the driver changes, but there needs to be a better way than having to analyze the hardware to that extent.

[QUOTE=krOoze;42986]4. Vulkan does not memoize state unless it has to. If it has state, it tries to be constant object-oriented state where possible.
5. Vulkan tries to support all contemporary HW.[/QUOTE]

And that is why it is excellent, for the time being. At the same time, this sets a foreboding precedent: “Hey vendors, its okay to keep selling this kind of equipment. Just look at all the developers happy to deal with it.”

The problem: A plugin can be anything, and needs to be able to create and manage arbitrary objects on the device (mainly for compute). There is no way around this. As I mentioned, I was a bit hyperbolic about the requirements.

To be specific (and I get complaints about this), clients need to specify use-cases per arena. Its easier for me to argue with “that sucks, oh well”, then with a thread filled with “but, why?”. Another foot in my mouth, but I can’t edit the post. There are limitations, obviously. For example: Images allocated in arenas that are both device and host visible simultaneously (as with a memory mapping) can only ever assume VK_IMAGE_LAYOUT_GENERAL and have VK_TILING_LINEAR. Consequently, every usage so far has been on-device with separate host staging, which allows for virtually any kind of resource allocation.

As for strided inputs: The need for structures here could be eliminated with separate vkCmdPipelineXBarrier functions and a separate vkOptimizePipelineBarriers function. That way, the application can use whatever it wants, and the implementation is then free to optimize redundant barriers, but only when that function is called. Supposedly, it already re-orders barriers anyway.

Back to concurrency: Asking for a VkCompletionPort isn’t going to force anyone to re-write anything (except add a few lines to the header), and its far from impossible to implement, even on mobile devices. With that, there is no need for a VkFence anywhere in free-threaded mode, and the engine can pre-allocate a fixed number of fences up-front for staged mode, as with any other implementation.

It already waits for any fence. The problem is that a short-running task could have been submitted to the pop thread after it entered a vkWaitForFences call. It needs to loop back to pick up another batch of fences, otherwise said task will end up waiting for those prior. This is nearly identical to usage of select(), and is why it is called a “select loop”. The idea of using an fd for the sole purpose of waking up the selector thread is why its a concurrency anti-pattern in some contexts, mine in particular. This is also why we like things epoll() and similar.

[QUOTE=krOoze;42986]It is not like we consider it a pinacle of humankind and only thing ever worth doing.
The thing is that “interactive media” got so complex that it practically does whatever technique or use case you would find in other kind of apps. Therefore the sentiment “it works well for games, so it is probably OK generally”.[/QUOTE]

I wasn’t expecting perfection, but not gaping holes, either. With an explicit API, we need explicit knowledge. What we have for API-supplied device-wisdom is basically a slight superset of what OpenGL could provide, but in an admittedly more convenient format. I’m not averted to building something complicated; the specification just leaves too much open-ended to address this in general. Queue family selection is ultimately undecidable without a benchmark, which is really overkill unless its built into a re-usable framework.

Overall, just give me VkCompletionPort and I’ll shut up about most of these problems. That, and a way to discover POD-ness of on-device allocations and I’ll just go away.

They’re part of the ARB, and could very well stumble upon these musings. I don’t have any contacts in that industry, so I have no idea who to speak with. However, as you noted, we’re looking at decades of hardware adoption nearing the crest of its hysteresis curve, which means it would require either a revolutionary new and inexpensive technology to get people to switch over right away, or hundreds (yes, hundreds) of years of marginal, evolutionary changes that gradually erode the cruft. I blame phones and consoles.

[QUOTE=krOoze;42986]I used to think like that, but if you think about it it actually matches reality. Compute is basically glorified fragment shader. Graphics is a superset of compute. It contains all the generalized compute stuff + some very specialized graphics accelerators that are otherwisely useless for anything else. It is typical generalize, but slow vs specialized, but fast.

And well, it is shoehorned. People buy GPUs to play (graphic) games, not to compute last digit of Pi or something. I am sure you could find a pro card that only implements OpenCL for some supercomputer, but it probably is not for us mere mortals.[/QUOTE]

In my hypothetical world of industry driven to create functionally correct, simple, and efficient designs, we would have only compute kernels and programmable simplex voxelizers. Need triangles? That’s just a set of 2-simplexes voxelized into an NxMx1 volume. Need tetrahedra to interpolate a scalar field? Bind and go! Look ma, no depth peeling! Need blending? Make a kernel to combine the output of the prior stage or two into another volume. Want to do depth testing for an old-fashioned rasterizer effect? Make a kernel for it and use a garden-variety flat scalar volume. Don’t need it? Well, it isn’t there, so the concept isn’t taking up space in the pipeline state object - having to do nothing at all is the best kind of optimization. Also, you can have as many stages as you want.

For hardware designed with these concecpts in mind, with enough information about its peculiarities, it is theoretically possible to optimize compositions of kernels and voxelizer invocations the same way renderpasses were justified, and without all the graphical baggage. Extending renderpass to “composition”, and once again demanding an application work out resource scheduling ahead of time, and it could be just as performant. This, for me in this context, is the crux of “don’t pay for what you don’t use”.

programmable simplex voxelizers. Need triangles? That’s just a set of 2-simplexes voxelized into an NxMx1 volume.

Out of curiosity, how do you even propose to implement hardware acceleration of voxelisation? All I know about the technology is that octotrees are somehow involved. And memory heavy operations such as search in a tree or adding elements to a tree is not a task a specialized ASIC will handle any better compared to, say, compute shaders.

Let’s assume that this is true. Let’s assume that a Vulkan implementation could keep track of these things and handle them just as well as the external code. That it can detect where images get used, that it synchronizes between queues automatically based on commands in CBs, etc.

So what?

If the external code doesn’t need the implementation to keep track of this, if “keeping track” of it is trivial for the external code, then any performance the implementation spends on managing this stuff is a waste of that external code’s time. Every conditional branch in a vkQueueSubmit call, every “check the list of resources we’re about to use”, every “see if I need to insert a semaphore here” is time that does not need to be spent for this application.

It’s not making their applications any better. Why should developers for apps that don’t need such tracking pay this price?

Not only that, what if there’s a bug in this complex chain of mutexes and synchronizations in the implementation? Can you go in and fix it? No; you’re stuck with it until a new driver release comes out.

This is the OpenGL problem all over again.

As for your assumption (that implementations can automate “low-level tasks” with no loss of performance compared to the high level), let’s see some evidence. Because I have evidence to the contrary.

On Github, you mentioned wanting an “in-between” API. Well, you had that: it was Direct3D 11’s deferred contexts. Those were basically command buffers; they could be asynchronously built and managed. Why didn’t that take?

Because AMD never implemented it. Their hardware uses image layouts more than any other hardware out there. And thus, it is their hardware and their implementations that will suffer from having to manage layout transitions. When you submit a deferred context, the implementation would have to have checked the layout of every resource and invoked a transition if it was not where it needed to be.

Automatic layout management simply was not a reasonable thing for them to implement. Why would that be any different for Vulkan than D3D11?

Vulkan is already very well designed in the direction you describe. Points of CPU/GPU synchronization are few and far between. Those few points of synchronization are well documented and not easily triggered. And a well-coded application can avoid such synchronization except when absolutely necessary.

Your problem is that you keep wanting the CPU and your “display list” design to impinge on GPU stuff. It is not Vulkan or the hardware that is creating this “high-latency Procrustean bed”; it’s your application’s design.

Vulkan is very low-latency. Your Vulkan application is not.

But that contradicts what you just said. If the device is “on another planet”, then having the CPU wait on the device to reach a certain point is very much not something you should want to do. If the GPU is “on another planet”, tight coupling between the CPU and GPU should be discouraged, not encouraged.

Furthermore, you’re thinking of the problem backwards. You already have a way to have one CPU thread signal another: mutexes or condition variables or somesuch. Your essential problem is that you want a CPU thread to wait on both GPU fences and CPU mutexes. You have APIs to wait on multiple CPU mutexes, and you have APIs to wait on GPU fences. But you can’t do both in one call.

What you say you want is to make CPU mutexes into GPU fences. But making GPU fences into CPU mutexes would solve the problem just as effectively.

And you can do that already, if the implementation allows it. You can export those GPU fences to CPU mutexes, via vkGetFenceFdKHR/vkGetFenceWin32HandleKHR. Then you can use existing CPU APIs to wait on them.

“Fixed architectures” covers about 95+% of all high-performance graphics usage tasks. This includes things like alternate reality overlays, MRI visualization, VR home walkthroughs, and so forth. None of these application domains need the kind of flexibility that you’re talking about.

So what are the “few places” you’re referring to?

Vulkan as an API provides enough information to gain reasonable predictability of performance. Is it everything? No, but that’s an unreasonable standard. Not only that, making decisions based on hardware performance metrics is at best dubious. At worst, it leads to the application doing something suboptimal based off of some number.

And Vulkan promises nothing that is “future-proof”.

Being able to do that is not why you are allowed to allocate memory for the implementation. Also, the implementation is still allowed to allocate memory without your allocation functions (that’s what the “Internal” allocation functions are for. The implementation must inform you of such allocations, but you can’t control them).

Besides, how would that even be possible? If the driver allocates some memory, then it has to store that pointer somewhere, right? For example, if the you create a logical device, the driver has to allocate memory for it. Probably several objects, all linked back to that VkDevice object. So you can’t just move one of these objects to some other location without the implementation’s knowledge and consent.

Being “POD” is not a viable test.

Also, I’m curious as to how much device memory you’d be using that provokes several gigabytes of disk access.

Analyze the hardware to what extent? Queue families tell you what they can do. What more do you need to know?

For example, take a typical NVIDIA queue setup: one graphics queue family and one transfer queue family. Do you really need to benchmark the performance of the transfer queue to know that if you have asynchronous transfer needs, you probably ought to use that rather than your graphics queue?

Oh sure, it is theoretically possible that the cost of transferring objects between queues and semaphore waits will be more expensive than just using the graphics queue. But that’s really unlikely.

And note that even this possible inefficiency doesn’t make performance unpredictable, as krOoze defined it. A particular queue usage pattern will never randomly cause a hitch on a piece of hardware. Or at least, not without you knowing that is possible (since you put in the semaphore wait that caused the hitch). You choose that queue usage pattern; you set the semaphore wait; you knew what you were getting into.

The problem with your hypothetical world is that nothing useful happens there. People in that world spend so much time trying to generalize everything, trying to predict the future that they forget to actually do anything right now.

In your hypothetical world of “sets of 2-simplexes voxelized into an NxMx1 volume”, nowadays we might have scenes of 50,000 to maybe 100,000 polygons per second. Blending would basically be halving your performance (at best). Depth testing would again reduce your performance.

Your way of thinking leads to the idea that the days before programmability were just wasted effort. That the Voodoo 1 was utterly worthless because it was just a triangle rasterizer.

But in reality, the Voodoo 1, and the marketplace that it created, was absolutely essential to the modern programmable graphics world. Those non-programmable steps were a necessary part of the evolution of the modern GPU. At all times, practicality must trump idealism. We still eventually got to programmability.

Mobile GPUs are a great example too. They adopted programmability much faster than desktop GPUs. But that’s because people had already figured out how to do programmability; mobile GPUs were just copying bits out of desktop GPUs. Without having those bits to copy, they wouldn’t be able to get where they were nearly as quickly.

Trying to summon a future that is not ready for the present solves neither future problems nor present ones.

It’s easy to say that something is “theoretically possible” on a forum. Unless you’re actually bringing a product to market, or at least creating prototype hardware that can actually do it, you have no right to arbitrarily decide that hardware vendors aren’t doing their job.

I think I know what this is about now. He does not actually want that (and prefers his IOCP). He wants it only as a workaround/hack to enable adding VkFence to a wait operation after it was started. I.e. to add this fake host-host VkFence with all the real ones in vkWaitForFences. Then when he wants to add a new fence into the wait, he would interrupt the wait with a signal to the fake and start a new wait including the newly added fence.

@differentiable Maybe you can emphasize the point in your IOCP GitHub Issue, that you want to be able to add a VkFence to vkWaitForFences after it was started. Seems reasonable to me at first sight. Knowing which VkFence waked the wait (without for loop of vkGetFenceStatus) seems also benefitial to the IOCP case.

Yeah, thanks. As much nice it would be know everything about performance in a god-like way up front, the Paradigm as I defined it is much humbler as Alfonse here says.

You can still make a (bad) implementation that hitches. And performance characteristics can still differ across different devices.

Only thing that is different is that Vulkan API does not encurage (or even necessitates) the hitching like OpenGL did.

OK. Paradigm 6: Vulkan is extensible, to show off new techniques and features and not to stiffle innovation.

I’ll update it when I can get around to it.

Here’s a very rough sketch:


VkCompletionPortCreateInfo info;
// blah...
VkCompletionPort hPort;

result = vkCreateCompletionPort( hDevice, &info, &allocators, &hPort );
HANDLE_ERROR_OR_EXPLODE( result );

result = vkBindQueueCompletionPort( hDevice, hPort, hQueue );
HANDLE_ERROR_OR_EXPLODE( result );

// etc...

An example “task->onPush”:



// onPush and onPop are like std::function<>s, NOT member functions
//
// This callback is where a task can submit to whatever queues it requested
// in whatever order is appropriate. In this example, the submitted task
// requested just dstQueueFamilyIndex

VkSubmitInfo info;
// ...
// command buffers and other things
// ...
info.completionCookie = reinterpret_cast<uintptr_t>(pTaskBlock);

return vkQueueSubmit( pTaskBlock->pQueueControls[dstQueueFamilyIndex]->hQueue, 1u, &info, VK_NULL_HANDLE );


An example “pop thread”:



#define TERMINATION_COOKIE_VALUE ((uintptr_t)1)

uintptr_t cookies[POP_COOKIE_COUNT];
do
{
    uint32_t count = POP_COOKIE_COUNT;

    // vkWaitForCompletion( VkDevice, VkCompletionPort, uint32_t *pInOutCookieCount, uintptr_t *pOutCookies, uint64_t timeout );

    result = vkWaitForCompletion( hDevice, hPort, &count, &cookies, UINT64_MAX );
    HANDLE_ERROR_OR_EXPLODE( result );

    ASSERT( count <= POP_COOKIE_COUNT );

    while( count-- )
    {
        auto cookie = cookies[count];

        if( TERMINATION_COOKIE_VALUE == cookie )
        { return; } // task cleanup done by join thread

        auto pTaskBlock = reinterpret_cast<TaskBlock*>(cookie);

        pTaskBlock->onPop();
        pTaskPool->ReturnTask( pTask );
    }
}
while(true);


An extension that lets me somehow treat a VkQueue like a file handle suitable for IOCP would probably work as well.

This might also be useful for receiving vblank events, too.

… that just sounds like an awful design. As unpleasant as a polling interface may sometimes be, polling sounds like a far more reasonable solution for cases where you need to do task X after A, B, and C are done, but there may be an interrupt-level priority task Y.

I’m not an expert when it comes to threading and synchronization, but I’ve always believed that threading code is at its best (in terms of performance, correctness, and maintainability) when inter-thread dependencies are as minimal as humanly possible. Even if it means restricting what people can do. This design overall seems to revel in inter-thread dependencies. I’m also a fan of lock-less design, and this design has lots of locks.

Extensions kinda fail at it. At least for relatively radical ideas.

Now, there are some things that would be fairly simple to do via Vulkan extensions. For example, TBRs that can allow the fragment shader to do blending. That’s fairly easy to specify, and it would work well with the existing barrier/synchronization system.

But consider something more radical. Consider a new compute pipeline that allows you to bind arbitrary compute stages in a well-defined sequence, like a flexible graphics pipeline. How would that work with barriers? After all, the barrier structures all take enumerators as their stages; with this arbitrary compute shader sequence, you’d want to use indices (and probably not a fixed number, so none of that GL_TEXTURE0+i garbage). So now, for every data structure and function that does synchronization, you need to have a new version of that struct/function which can take indices instead of VkPipelineStageFlags.

That’s a lot of functions and data structures.

Oh yes, you can make an extension that does this. But it would be so enormous that you’d be rewriting a fair portion of the specification just to deal with it.

Every abstraction has limitations, and Vulkan is no exception. If substantial change comes to GPUs, Vulkan will have to be redesigned to change with them. It won’t just be some extensions or a point release or whatever.

[QUOTE=Alfonse Reinheart;42993]… that just sounds like an awful design. As unpleasant as a polling interface may sometimes be, polling sounds like a far more reasonable solution for cases where you need to do task X after A, B, and C are done, but there may be an interrupt-level priority task Y.
[/QUOTE]

Just interpreting @differentiable here.

I imagine it is intended for some kind of a producer-consumer situation.
I.e. buncha producers are randomly submited to queue with a fence, and a buncha consumers are standing by for the results.
So you need some kind of a event system that says there is a product available (i.e. pop on one fence signaled at a time). And you need to be able to add fences as you go (when additional producers are added).

Seems to me you only proven that radical ideas are radically hard to implement.

Alternative is to start completely from scratch, which is probably worse/harder. I mean eventually clean-slate engineering may be waranted(OpenGL -> Vulkan), but extensions are good enough to make a prototype (and oftentimes more than OK for production use). And the clean-slate approach is all the better then if the radical ideas accumulate over time, and also have some real-world use behind them already.

Also seems it worked fine historically. If I am not mistaken programable pipeline started as an extension. How can you get more radical than that?

Queue submission is not necessarily only source of a VkFence. Might be better and more explicit to just accept VkFences directly.

Even so, it has the threading problem unresolved.
I am gonna need Paradigm 7: Vulkan does not perform internal synchronization (if it can be avoided).
It seems this would require some internal synchronization. The queue completions can come from any thread (assumably the same that did vkQueueSubmit). And the wait is assumably on yet another thread.

I don’t want to come off as picking anyone apart, and I’m too lazy to use quote tags for absolutely everything right now.

[QUOTE=Alfonse Reinheart;42990]Vulkan is already very well designed in the direction you describe. Points of CPU/GPU synchronization are few and far between. Those few points of synchronization are well documented and not easily triggered. And a well-coded application can avoid such synchronization except when absolutely necessary.

Your problem is that you keep wanting the CPU and your “display list” design to impinge on GPU stuff. It is not Vulkan or the hardware that is creating this “high-latency Procrustean bed”; it’s your application’s design.

Vulkan is very low-latency. Your Vulkan application is not.[/QUOTE]

And I forgive that pointy tl;dr from you, good sir.

Display lists work very well in staged mode, which I’ve already explained. I wouldn’t have spent any significant time sweating over Vulkan if it turned out to be yet another vapid marketing gimmick. I’ve already said it is surprisingly good for what its designed to do, even when doing things “wrong”, but I’m not the sort of person who likes to go handing out compliments.

My complaints are about some of the gaping holes regarding device information I mention, and some trivial features that seem to severely confuse or upset some people with their mere mention. If this were a library meant only for a small class of priorietary devices for which the vendor supplied reasonably up-to-date and readily available metadata (timings, santiy checks, dos and dont’s, etc…), I’d just let it fly.

This is literally advertised to novices and consultancy firms alike as a panacea for their graphics and compute performance problems, at the expense of more complexity up-front and losing the “convenience” of older abstractions (oh please, OpenGL is an anti-convenience, this is already a breath of fresh air). Oh, yeah, and its not suitable for everything (what is?), just don’t pay any serious mind to our very CAD and simulation-oriented imagery. If the hype leading up to this were strictly games, games, games, benchmarks, and movies, in your face all the time - only this and emphatically not research, visualization, or parallel computing - and if OpenGL wasn’t the impossibly unpredictable mess that it is - Vulkan would not have a book sitting on my shelf.

So, some complexity we have, but it looks like we need a little more to close things up.

We use other libraries in conjunction, as this is required for some plug-ins w/o source. Given the scope of the problem, as will be described further below, it should not be surprising that we’d like to squeeze every last drop of performance out of this one, while also respecting the generality required of an “everything system”. I would not be surprised if I’ve missed a thing or ten, since I’m close to my limit with this.

Why GiB on the device? Many variables * many different representations * many nodes == lots of data going everywhere. Some nodes are involved in solving g*H^-1 for a step of some sub-problem, and even though we approximate H^-1 (a la. LBFGS) for most of these, it still needs a considerable amount of space on average. Some nodes are involved in updating pieces of an adaptive mesh for the next step, based on information that just became available from another cluster that was busy with some scalar field that now needs to be “sampled”. A mesh update is totally CPU bound, since this process is unreliable or not possible in compute due to non-conformant FP hardware and poor branching performance. To get an idea of what is going on: This is meant to solve multi-objective optimization problems where the input is a “constrained design space for a thing” and the output is an “optimal design of a thing”. Simulation is part of how it decides where to evolve the system.

The POD-ness problem applies strictly to images, since they have to be bound per-image. Buffers are trivial, as they are already sub-allocations.

The “mini-benchmarks” I mention are concerned primarily with transfer rates, and employ the heuristic of “fewer valid timestamp bits might mean bigger steps” in conjunction with a “large” minImageTransferGranularity when choosing which families to test for possible “best transfer” candidates for certain workloads. This happens if there is more than one family that share an identical set of capabilities. Once again, all because there is no guarantee that exposed queue families with identical capabilities have universally desirable performance characteristics for all supported operations, and that image data might be stuck where it is first bound. Admittedly, benchmarking ambiguous queue families is not robust at all: They could be distinct for different reasons, like bizarre device topology including separate address spaces, streaming MMUs meant just for video, or even an experimental neurophone-like device that fries the user’s brain if used more than once. There is no way to deal with this without having that very device-specific information available up-front, which is the vendor’s responsibility. Maybe an XML file full of device charateristics and non-trivial features otherwise not knowable from the API? That is a hole currently filled by specialized people, who are not always available, and not always right.

By the way: On my GTX 1080, I get 3 families: 1 general, 1 transfer-only, and 1 compute-only. No ambiguity, no need to test anything. Family 1 gets selected for all “large” transfers, unless all of its queues are busy, in which case one from the general-purpose family if available, then finally compute, then we wait for anything.

Why this is important for defragmentation: We’ll need to either go from one arena to another while shifting things down, or where we can safely assume an image is POD in a particular layout, we transition all images in an arena to that layout, destroy their handles, shift everything down with a few buffer transfers, then re-create them. If we can’t allocate an arena large enough, and can’t allocate a host-local scratch space large enough, we’ll need to write things out to disk. Sometimes, a node’s workspace disk is too-close-to-full of cached information from plugins, so we have to use a network drive in the worst case (someone’s data has to go somewhere, and its easier to send out hot data than cold). Is this fast? No, but it saves hard-earned data. Steps are expensive (as in, minutes-per-step), and failure means having to restart a step (or many, depending on which snapshots are “complete”). This is the worst real-time database-of-everything problem you can imagine.

Why Vulkan? Unlike OpenGL, we don’t have to keep rebuilding command lists. Unlike OpenCL or OpenGL, we can precisely control device residency with supposedly one less layer of high-latency abstraction under our resource system. If we have to, we aren’t constrained to do this in just one thread, and it doesn’t cost thousands of redundant library calls to submit a command list. The other problem is the complete uselessness of shared contexts. For example: There is usually a problem with disparate image uploads serializing w.r.t one another in different threads.

As with IOCP: I’m not sure why this incites such revulsion or confusion in some responses. It seems like an obvious addition to me, since this a problem of inverted control, and it isn’t always reasonable to expect all applications to know about every ongoing task at every moment. In this framework, it is possible to do that by collecting submission requests and batching them at appropriate application stages (there would need to be many places where this happens), though this adds another kind of object that needs to be passed around (might be a good idea, actually). Still, “just wait for anything to finish and unlock all related resource locks” offers a scalable fallback where the batching option might not be reasonable.

Is this problem really encountered so wrongly on all levels? If its that painful, I’ll settle for an extension. I may complain, but I don’t think carping on git or a forum is going to get anyone fired.

<SOAPBOX>
I keep getting: “Well, the hardware was made for games, and had to change, so some stuff is leftover”. I know that, I was there and watched it happen. My mistake is expressing a negative opinion about it, but this is a forum and we just LOVE to share opinions, so I’ll share mine again: It is all terrible. It could be better. It should be better. Whole industries can make truly enormous mistakes, and we shouldn’t be averted to calling them out. I didn’t direct my career into becoming a celebrity (and I’m not really looking to do that), so I have virtually no voice here. I tried to be a billionaire, but I have yet to saddle a unicorn.

In practice, replacing specific functionality with something general under the hood has always created new possibilities. We wouldn’t have compute today if “that’s just what the hardware was meant to do, and vendors know best”. Yes, they follow the money. The money doesn’t know where it is going. If it did, I would have retired years ago.

There was already a demand for the feature, albeit for the wrong reasons (IMO), and it became commonplace because enthusiasts were coerced into buying one generation of gimmicky tech after another (budgets, deadlines, and marketability - I know them well). Yes, gimmicky. Especially when the more precocious vendors at the time did things like expose ARB_shader_objects, but not all the required entry points. We all wrote it off as growing pains.

So now, after many agonizing, expensive iterations, everyone can have a halfway decent massively parallel vector processor on the cheap, though a few caveats are that its primary function is chiefly entertainment with a mention of computation, vendors are entirely profit-driven, and research isn’t immediately profitable. In other news: Water is wet. To reach this point, an entire industry had to grow up around one rushed idea after the other, and dump billions of dollars and millions of hours into this problem: “How do we make really big, complicated pictures really fast?”, which looks conspicuously like a trivial subset of this problem: “How can we do lots of structured linear computational work in parallel really fast?” At least some vendors get the idea, but we have yet to see a cross-platform CUDA that doesn’t involve “shaders” and related maintenance hazards tailored for them.

Along comes the programmable pipeline, then soon a language designed in the spirit of the really-meant-for-raytracing renderman framework. Not surprisingly, everyone immediately wants to implement OIT and reflective surfaces, which are trivial strengths of raytracing (along with a smaller memory footprint), which is what the hardware can’t do (unless you ask very, very nicely and sacrifice every last quantum of your sanity). A big profitable “whoopsies, oh well”.

One development builds compromisingly upon another and here we are with an awkwardly renderman-themed compute language, and even more of the unscrupulous vendor locking we’ve come to know and love. Instead of, at some point, the right people seeing the problem for what it was and addressing it with a general solution that could offer specialized facilities as extensions (and having the money, too), we have mostly specialized libraries designed to offer general solutions as extensions, and very few useful guarantees about anything at all (not a whimsical statement, sadly, and is also ironically why I’m employed). For better or worse, it is already everywhere and spreading into permanence. Yay, job security?

The notion of “embarrassingly parallel” had existed before the Voodoo (I actually still have one somewhere, I think), though I understand it wouldn’t have been practical to offer a “computational nirvana” peripheral to a virgin market wherein one of the most popular activities at the time was getting Doom or Quake to run smoothly on one’s new and barely-stable wintel.

Don’t get me wrong: I love playing games, I love developing games (when the objectives, budget, schedule, and people are agreeable), and I once worked as a graphics engineer for a visual effects company. They just don’t represent an all-inclusive set of requirements against which we should be setting things in silicon, then trying to cast all problems into nails for our trendy new hammer. However, I’m painfully aware there’s no guaranteed money in a-priori generality, either.
</SOAPBOX>

My complaints are about some of the gaping holes regarding device information I mention, and some trivial features that seem to severely confuse or upset some people with their mere mention.

The thing is, our response is essentially:

  1. They’re not “gaping holes”. They’re not “holes” at all; they’re only “holes” from your perspective because your design requires them. Vulkan’s design does not.

  2. What you want are not “trivial features”. Implementing them would have costs for those who don’t need them.

If the hype leading up to this were strictly games, games, games, benchmarks, and movies, in your face all the time - only this and emphatically not research, visualization, or parallel computing - and if OpenGL wasn’t the impossibly unpredictable mess that it is - Vulkan would not have a book sitting on my shelf.

At the end of the day, I can’t help Khronos’s insipid “Graphics and Compute Belong Together” marketing spiel. Vulkan is no more a “parallel computing” API than OpenGL.

But there’s no reason you can’t use Vulkan for visualization applications. Or graphics research applications (though if you’re just inventing rendering techniques, OpenGL would probably be a better tool). Or for AR tools. Or for CAD or simulations. Or for many other things that are not “games, games, games, benchmarks, and movies”.

The world of graphics isn’t sharply divided into “game stuff” and “everything systems”. There’s a lot of graphics applications that can work just fine within the general ideals of the Vulkan graphics system. Yours may or may not be one of them. But just because yours isn’t amenable to Vulkan doesn’t mean that other people’s systems aren’t.

The POD-ness problem applies strictly to images, since they have to be bound per-image. Buffers are trivial, as they are already sub-allocations.

I don’t know what you mean here. The POD problem you previously mentioned had to do with device allocations and rearranging them in memory. I don’t know what that has to do with images.

By the way: On my GTX 1080, I get 3 families: 1 general, 1 transfer-only, and 1 compute-only. No ambiguity, no need to test anything. Family 1 gets selected for all “large” transfers, unless all of its queues are busy, in which case one from the general-purpose family if available, then finally compute, then we wait for anything.

How is that any less “ambiguous” than the queue families provided by a typical Radeon? Are you being confused because AMD explicitly reminds you that a compute queue can do transfers too? Because Vulkan actually requires that behavior; implementations are not required to remind you that compute queues can be used for transfers, but they very much can.

Can you give an example of “ambiguous” queue family setups? The Vulkan DB I linked to has pretty much all hardware and their queue setups.

As for your “soapbox”, there’s not really much to say. You’re seeing the same facts as the rest of us; the market for graphics processors and economic forces and the like. You’ve simply decided that you don’t like the facts. That’s fine, but not liking them doesn’t mean that you can say that something different should have happened, since the facts clearly tell us that conditions for it to have happened the way you want did not exist.

However, I would like to point out that Renderman is not a traytracing framework. While Pixar’s Renderman currently does use a form of ray tracing, the REYES rendering system originally used by it was primarily just a scan converter. Yes, it would use ray tracing, but only under very specific circumstances.

Also, while GLSL borrows terminology from Renderman, that’s pretty much all that it borrows. And at this point, modern GLSL terminology is essentially divorced from Renderman. The only term still left that maps to any Renderman concept is “uniform”.

[QUOTE=Alfonse Reinheart;42997]The thing is, our response is essentially:

  1. They’re not “gaping holes”. They’re not “holes” at all; they’re only “holes” from your perspective because your design requires them. Vulkan’s design does not.

  2. What you want are not “trivial features”. Implementing them would have costs for those who don’t need them.[/QUOTE]

#1: is a managerial technique called “not my problem, not our problem, not our cost”. It is really called a hole in the requirements, and will become a thorn in everyone’s sides (both vendors and users) for the same reason Win32 is never going away.

#2: Nobody needs to use a completion port in the same sense that nobody needs to use a pipeline cache (yes, we use them).

[QUOTE=Alfonse Reinheart;42997]
At the end of the day, I can’t help Khronos’s insipid “Graphics and Compute Belong Together” marketing spiel. Vulkan is no more a “parallel computing” API than OpenGL.[/QUOTE]

So it seems. We can’t force our clients to use just NVidia hardware, though.

I know I tend to say a lot. I have to repeat myself in emails quite a bit.

Seeing a 10x improvement in performance with the visualizer, as well as at-long-effing-last explicit resource control means that this application is very much within the domain of Vulkan, until something better comes along.

It has been staring me in the face for a while as: “This can be done, it must be done”, so I thought it might not require too much explanation, that and I have to be careful.

The whole concept of non-POD image layout is annoying to me, since the spec offers no guarantee, and with a software renderer its just a re-arrangment of plain data that can be moved around freely. It is hard to imagine it being otherwise, but I can’t rely on intuition.

As you might draw from my writing, my fried brain is kinda all over the place, and has been for an entire year.

[QUOTE=Alfonse Reinheart;42997]
How is that any less “ambiguous” than the queue families provided by a typical Radeon? Are you being confused because AMD explicitly reminds you that a compute queue can do transfers too? Because Vulkan actually requires that behavior; implementations are not required to remind you that compute queues can be used for transfers, but they very much can.

Can you give an example of “ambiguous” queue family setups? The Vulkan DB I linked to has pretty much all hardware and their queue setups.[/QUOTE]

We have to code against the specification alone, and whatever it leaves out, if that makes sense. Basically, we assume the most asinine device possible (and within reason) that still meets the specification. This part of our stress testing procedure called “fuzzing”. Our “absurd device” simulator currently exposes 48 different families all generated with random capabilities routed into whatever the native hardware supports. Each family is assigned a random latency, which is simulated with a nanosleep() just before and after vkQueueSubmit, and randomly emits any appropriate error from any entry point capable of doing this. I.E. VK_ERROR_DEVICE_LOST from the very last vkEndCommandBuffer in a thread compiling a display list, and it needs to recover gracefully without human intervention. Command buffers are deliberately polluted with useless operations such as empty draw calls, redundant bindings of already-bound descriptor sets, and so on. The engine must make an optimal selection of families within the purview of the specification to be deemed robust. So, as far as the spec is concerned, we are following convention - “a device has N queue families…” and the “…” is where it drops into this hole I’m talking about. This is where we’d have to fill it in with information outside the specification. We can’t have an expert sitting on every node, or maintain code tailored for just one client, and then another, and then another, and so on. The product needs to survive on its own in the wild.

Er, I mean VK_ERROR_OUT_OF_DEVICE_MEMORY or VK_ERROR_OUT_OF_HOST_MEMORY. Basically, whatever it is allowed to return.

#1: is a managerial technique called “not my problem, not our problem, not our cost”. It is really called a hole in the requirements, and will become a thorn in everyone’s sides (both vendors and users) for the same reason Win32 is never going away.

You could use this logic against any API. You can point at any API and say, “well, it can’t do this, which I say is totally important and in the future, the lack of this will make the API a millstone around our necks”.

You’re basically saying that your design, your needs, your application is the future, and any API that isn’t quite doing things the way your design says they should be done is just living in a past that will eventually have to be ditched.

John Carmack famously said once that OpenGL implementations should have no shader limitations; that if a shader blew past limitations, the IHV should break the shader up into multiple shaders and just make it work as if it were one shader.

Now, we can clearly see from a modern perspective that this was a powerfully stupid request. Indeed, Vulkan goes in the exact opposite direction: less IHV code rather than more. But my point is that Carmack was talking about making OpenGL more future-proof. Just like you. You’re saying that the future looks like X, and Vulkan doesn’t look like that, so it better change to match.

But Carmack could not predict the future. Just like SGI back in 1992 could not predict the shader revolution. Just like so many other failed attempts to predict the future of APIs.

Why are your predictions any different?

Predicting the future keeps you from living in the present. If these really are problems, there is nothing that says they can’t be worked out. Vulkan already has built-in backwards-incompatibility (major releases are not required to have any compatibility requirements). If they need to make some serious changes, that can still be done.

#2: Nobody needs to use a completion port in the same sense that nobody needs to use a pipeline cache (yes, we use them).

“Completion ports” are not the only feature you’ve talked about in this thread. You’ve discussed automating layout transitions and queue ownership. You’ve discussed automating semaphore usage. Among other things. These unequivocally have costs for users who can manage these things themselves.

Also, any feature has a cost: the cost of its implementation. Even ignoring that “completion ports” are a platform-specific thing, implementers have to do whatever it takes to implement queues as “completion ports”. I know nothing about these things, but I can’t imagine it would be particularly trivial for them to implement them.

So the time IHVs spend on implementing, debugging, and maintaining them is time not spent on implementing, debugging, and maintaining something else.

It has been staring me in the face for a while as: “This can be done, it must be done”, so I thought it might not require too much explanation, that and I have to be careful.

The whole concept of non-POD image layout is annoying to me, since the spec offers no guarantee, and with a software renderer its just a re-arrangment of plain data that can be moved around freely. It is hard to imagine it being otherwise, but I can’t rely on intuition.

As you might draw from my writing, my fried brain is kinda all over the place, and has been for an entire year.

As you might imagine, it makes it difficult to have a productive discussion when you yourself cannot effectively articulate what you want. You just spent three paragraphs responding to my comment. And yet in all of that… you never answered my question.

I don’t know what a “POD image layout” is. However, from various descriptions, I’m guessing you’re talking about this.

Personally? I don’t see that as an issue of high moment. That is, I agree that it would be nice to be able to shift image data around in memory with simple memcpy’s and create new VkImages from them. But the inability to do this not really significantly impeding the majority of Vulkan users.

There are a lot more issues of greater importance than solving this one. I mean, it’s not like you can’t copy image data around. You just have to use a slightly inconvenient way to do it, issuing multiple commands and so forth.

As for how to solve it? Well first, you have to acknowledge the possibility that a particular image format may be more than just its data. As such, you have to query whether or not the implementation allows a particular VkImage is “relocatable” (“POD” is really the wrong term). This would be a “format feature”, but it is only available for swizzled images.

Next, when you create such an image, you have to create it as a relocatable image (that is, you have the intent to relocate it). After that, it should be a simple matter of binding it to the appropriate location in memory. And of course changing the memory aliasing rules.

We have to code against the specification alone, and whatever it leaves out, if that makes sense. Basically, we assume the most asinine device possible (and within reason) that still meets the specification.[…]So, as far as the spec is concerned, we are following convention - “a device has N queue families…” and the “…” is where it drops into this hole I’m talking about. This is where we’d have to fill it in with information outside the specification.

That is not a “hole in the specification”. That is your own personal paranoia about some possibility that has not happened and, as far as you know, will never happen.

There is nothing that “makes sense” about a device with “48 different families all generated with random capabilities”.

Vulkan should not be blamed for your choices. Other application developers don’t feel the need to do this; why do you? To be “future proof”? To be adaptive to some unknown and unknowable hardware that doesn’t exist yet?

[QUOTE=differentiable;42996]
Along comes the programmable pipeline, then soon a language designed in the spirit of the really-meant-for-raytracing renderman framework. Not surprisingly, everyone immediately wants to implement OIT and reflective surfaces, which are trivial strengths of raytracing (along with a smaller memory footprint), which is what the hardware can’t do (unless you ask very, very nicely and sacrifice every last quantum of your sanity).[/QUOTE]

There are a ton of GPU accelerated raytracing frameworks, including AMD branded one. They don’t seem to have serious problems with existing. What hardware support you might possibly need? Ray-triangle intersection? This is register pressure problem, not compute problem, and GPUs already have crazy numbers of registers.

I think he was talking historically, around the dawn of programmable hardware.

I really don’t appreciate the position you keep taking with this, and hyperbole like this is why I generally don’t bother with forums. We want a standard alternative to a maintenance hazard called: Buy everything new and test it, or hire more people to fill in the blanks. We don’t have the time for that.

I’ve already explained this has been implemented. It is automated already under what I’ve called a “display list”. The result is almost as efficient as a hand-written analogue. This is tl;dr #2.

I’m tired of this behavior: “Treat OP like an incompetent ass because he wants a feature we don’t understand”. This goes along with just about every other caustic, unproductive response you’ve had to offer.

IOCP and other message systems are so simple its laughable that they weren’t already a feature. I thought their absense was just to “get the API out there for feedback”.

The only reason Vulkan needs to do this is because it owns the system objects involved, and either the application needs access to those objects to set it up, or the driver can do this safely and more efficiently.

There is no excuse for not knowing anything about them, unless you are completely new to this field, or have never done any network programming whatsoever.

[QUOTE=Alfonse Reinheart;43000]That is not a “hole in the specification”. That is your own personal paranoia about some possibility that has not happened and, as far as you know, will never happen.

There is nothing that “makes sense” about a device with “48 different families all generated with random capabilities”.

Vulkan should not be blamed for your choices. Other application developers don’t feel the need to do this; why do you? To be “future proof”? To be adaptive to some unknown and unknowable hardware that doesn’t exist yet?[/QUOTE]

So, we should just trust our intuition, then? We should trust vendors to always be consistent and reasonable? Were you born yesterday? In case you didn’t notice, Intel once again undermined security for their customers (incl. the government) for who-knows-how-long with a critical mistake they probably would have found if they designed and tested their hardware more carefully, and you are telling us our policies are paranoid.

Our products work almost anywhere, and are reliable, because we stress test them with the compliantly absurd. This is a selling point, and is why we are competitive. The device could have 1000 families according to the specification, because there is no mention of a restriction, or any reason for a vendor to expose a redundant family or not. It is absurd, but it is possible within specification. Never hard-code an index unless it is written in the hardware’s manual.

Everyone on this project agreed that this was necessary in the general case, since all of us have very sound reasons for our distrust. We needed a device fuzzer anyway, and this is just one variable among many that can be changed. We still use and maintain files outlining device characteristics. We have worked with other kinds of hardware that required this, and have found it is always a maintenance hazard. You are welcome for being freely offered some very costly insight on the matter.

Device topology is arbitrary, and the API doesn’t reveal anything about it. That is wrong. It cannot be made acceptable by passing that responsibility off to a party that knows nothing about it. It is well within reason for vendors to provide some way to discover topology isomorphic with what an application expects without resorting to an unmaintainable list of devices and driver versions.

After everything I have explained, in excruciating and nearly compromising detail, you again and again insinuate the solution is wrong and we are wrong for even thinking about it. Therefore by extension, every suggestion I’ve made is wrong on every level, even though you don’t fully understand it or have no experience with it.

I appreciate that you have a critical opinion about the complexities we face, and how we face them, and I understand it may be an unsightly complicated mess to you. The purpose of this thread was to explain the “why”. I also blame myself for spouting off about unrelated topics.

I am no longer interested in your input, as I think I get where you are coming from.