Contiuation of Github #650

Salabar · January 4, 2018, 1:37pm

Our products work almost anywhere, and are reliable, because we stress test them with the compliantly absurd. This is a selling point, and is why we are competitive. The device could have 1000 families according to the specification, because there is no mention of a restriction, or any reason for a vendor to expose a redundant family or not. It is absurd, but it is possible within specification. Never hard-code an index unless it is written in the hardware’s manual.

The point of Vulkan is that developer accepts the responsibility to optimize for current GPU architectures (or individual pieces of hardware if you don’t feel like spending time with your family). When using multiple queues, you tell your GPU, that two tasks may overlap, but this will not provide any benefit when both tasks compete for the same resource. When developing set rendering pipeline around some set number of architectures, you can easily profile and analyze which stages of the pipeline can run in parallel for the best performance. A shader that is compute bound on a big GPU may turn out to be bandwidth bound on an APU or mobile GPU. If ROPs on Navi GPUs will be much faster than on prior GCN chips, most optimizations will go right into the trash bin. It is literally impossible to future proof this from an API to fit the needs of your uber-generalized use-case.

The spec guarantees at least one universal queue and one copy queue. Just use those.

Alfonse_Reinheart · January 4, 2018, 4:45pm

“Almost as efficient” in which scenarios? For what kinds of graphics tasks? What exactly is the “hand-written analogue” doing?

Automation is good for any scenario where manual work is either impossible or would require lots of manual inter-thread communication that is easy to screw up. But if that’s not my scenario, then what good is “almost as efficient” when I can get “as efficient” already?

And again, you ignore the fact that implementing this automated system is not cheap, nor is maintaining it free. Remember: we’re talking about people that would routinely screw up basic aspects of OpenGL. Why would you trust them to get layout and synchronization right?

Small drivers are their own reward.

[QUOTE=differentiable;43006]Everyone on this project agreed that this was necessary in the general case, since all of us have very sound reasons for our distrust. We needed a device fuzzer anyway, and this is just one variable among many that can be changed. We still use and maintain files outlining device characteristics. We have worked with other kinds of hardware that required this, and have found it is always a maintenance hazard. You are welcome for being freely offered some very costly insight on the matter.

Device topology is arbitrary, and the API doesn’t reveal anything about it. That is wrong. It cannot be made acceptable by passing that responsibility off to a party that knows nothing about it. It is well within reason for vendors to provide some way to discover topology isomorphic with what an application expects without resorting to an unmaintainable list of devices and driver versions.[/quote]

OK, let’s play this game out:

What would this “way to discover topology isomorphic with what an application expects” actually look like? Not just a handwavy bit of whatever, but an actual, honest-to-God API suggestion. Show me what you would consider to be an ideal VkQueueFamilyProperties data structure.

My point is that everything you’re talking about doing is your choice. We’re not talking about some immutable fact of rendering application design. It’s not something everyone has to do. Nor is it something everyone ought to do. It’s something you do.

You’ve made a deliberate choice to assume that implementations of Vulkan will do nonsensical things and therefore you should test for them. You’ve made a deliberate choice to automate many of the features of Vulkan that it makes explicit. You’ve made a deliberate choice to use platform-specific APIs with a cross-platform API.

And there’s nothing wrong with any of those choices.

But you’re asking for changes to the API based on these choices. You’re saying that your choices ought to be incorporated directly into Vulkan, whether someone wants to use them or not.

But that’s not merely what you’re saying. You’re saying that these choices ought to be incorporated into Vulkan. And therefore, you’re tacitly saying that if anyone isn’t making these choices, then they’re doing it wrong.

I cannot accept that.

differentiable · January 5, 2018, 10:31am

[QUOTE=krOoze;42994]Just interpreting @differentiable here.
I imagine it is intended for some kind of a producer-consumer situation.
I.e. buncha producers are randomly submited to queue with a fence, and a buncha consumers are standing by for the results.
So you need some kind of a event system that says there is a product available (i.e. pop on one fence signaled at a time). And you need to be able to add fences as you go (when additional producers are added).[/QUOTE]

Yes, this is a muli-producer, multi-consumer scenario (even though there is just one consumer in that example). Here’s the MSDN page related to completion ports in user space:I/O Completion Ports - Win32 apps | Microsoft Learn

As usual, MS’ API makes the problem look more complicated than it really is. You can make and manage all the same objects through either the related NtXXX or ZwXXX system calls (or DeviceIoControl, if you’re daring) depending on where that part of your driver needs to live. On Linux/Android, you can just use the usual pthread primitives (EDIT: I think, I’ve only ever developed windows drivers). Conceptually, if the driver can signal a fence, then it can also post to a completion port.

There are a few different ways to go setting up the whole “send me a cookie/pointer” situation, and they can all be realized on any system. If the VkSubmitInfo is set in stone, so to speak, this would be a good reason to use the pNext member - point it to a VkCompletionInfoEXT or whatever suits the naming convention.

You lost me there… The point of a completion port here (really a message queue) is to avoid making and passing around a multitude of VkFences when working with multiple parallel queue submissions. Fewer system objects to manage is always a better thing.

Also, you aren’t constrained to consume on the port in a dedicated thread. This can save single-thread renderers the hassle of needing to pass fences around with a slight modification:


VkResult vkWaitForCompletionEXT(
    VkDevice hDevice,
    VkCompletionPortEXT hPort,
    uint32_t *pInOutCookieCount,
    uintptr_t *pOutCookies,
    VkBool32 bWaitForAllCookies, // set to true to pop exactly N cookies
    uint64_t timeout );

Now, a single-thread renderer can make N queue submissions and wait for all N to complete at some later point without carrying around an array of VkFence.

[QUOTE=Salabar;43007]The point of Vulkan is that developer accepts the responsibility to optimize for current GPU architectures (or individual pieces of hardware if you don’t feel like spending time with your family). When using multiple queues, you tell your GPU, that two tasks may overlap, but this will not provide any benefit when both tasks compete for the same resource. When developing set rendering pipeline around some set number of architectures, you can easily profile and analyze which stages of the pipeline can run in parallel for the best performance. A shader that is compute bound on a big GPU may turn out to be bandwidth bound on an APU or mobile GPU. If ROPs on Navi GPUs will be much faster than on prior GCN chips, most optimizations will go right into the trash bin. It is literally impossible to future proof this from an API to fit the needs of your uber-generalized use-case.

The spec guarantees at least one universal queue and one copy queue. Just use those.[/QUOTE]

Come to think of it: Why not require exactly that standard set of families up-front, and then let the driver expose more per extension? You’ll know about the extensions you want, and correspondingly whichever extra queue families appear because of it, if any. This concept worked OK with OpenGL, and its looks to be in the spirit of Vulkan. All it requires is a change of wording - and compliance. That won’t be too much of an issue for us, since we code against the spec, and if the spec says “all families distinct unless more are exposed by an extension”, we can just tell clients to update their drivers if things start going really slow or breaking.

That’s easier to deal with than “I have 7 families, and I don’t know why”. My intuition tells me they’d prefer to keep the standard two families at index 0 and 1, but I can’t allow anyone to hard-code an index.

differentiable · January 5, 2018, 12:04pm

[QUOTE=Alfonse Reinheart;43008]“Almost as efficient” in which scenarios? For what kinds of graphics tasks? What exactly is the “hand-written analogue” doing?

Automation is good for any scenario where manual work is either impossible or would require lots of manual inter-thread communication that is easy to screw up. But if that’s not my scenario, then what good is “almost as efficient” when I can get “as efficient” already?

And again, you ignore the fact that implementing this automated system is not cheap, nor is maintaining it free. Remember: we’re talking about people that would routinely screw up basic aspects of OpenGL. Why would you trust them to get layout and synchronization right?

Small drivers are their own reward.[/QUOTE]

I’ll make this clear: It already exists. It can be done, and we did it. High-level resource management, high-level rendering that can be optimized on-the-fly, all without having to dance around a driver that is trying to predict what we’re doing.

[QUOTE=Alfonse Reinheart;43008]
OK, let’s play this game out:

What would this “way to discover topology isomorphic with what an application expects” actually look like? Not just a handwavy bit of whatever, but an actual, honest-to-God API suggestion. Show me what you would consider to be an ideal VkQueueFamilyProperties data structure.[/QUOTE]

See my prior post. Why not: “all families distinct unless more are exposed through extensions”? The application will ask for whatever extensions it wants, assuming all pertaining knowledge. If an extension adds 2 extra families for streaming video, or maybe a special family meant to be used only with a set of audio-oriented entry points (mixing on compute, hmmmmm…), then great!

If we need to, we’ll look at native codegen just to eliminate as much application overhead as possible. So far, “display list” calls are dominated by vkQueueSubmit, to nobody’s surprise.

The use-cases, or choices, I talk about just give context, and are exposed here to give answers to all the “why?”.

Nobody has to use a completion port.

Nobody has to implement defragmentation, or care about whether or not an image layout is POD in a particular region of memory. I’m throwing this idea out there to hear what others have to say about the problem, as I don’t know everything about every driver, or every format in every layout.

After thinking about this discussion so far, I think the most reasonable approach to the queue family issue is to require all distinct families up front, and expose more through extensions. The reasons for any families with duplicate queueFlags are now entirely extension-specific, controlled by the vendor in a manner that is documented somewhere (you have to know about it somehow), and the request is made explicit where the device is created (or wherever you’re filling out the extensions list).

Alfonse_Reinheart · January 5, 2018, 3:33pm

First, a correction: “The spec guarantees at least one universal queue and one copy queue.” This is false. Indeed, the spec doesn’t even guarantee a graphics queue. The only queue-based requirement in specification is that, if an implementation supports a graphics queue, then it must expose at least one queue family that is both graphics and compute.

As such, there is no “that standard set of families”.

So instead, we’ll investigate “all families distinct unless more are exposed by an extension”.

First, that’d have to be done as an instance extension rather than a device extension. After all, queues are activated at the same time as features and extensions: device creation time. So unless it is an instance extension, there’d be no way to active the extension and query the properties of said queue.

Second, I don’t understand how this helps you. What exactly is it that you’re afraid will happen if an implementation offered two queue families with the same properties? Are you scared that your code will pick the “slow queue” of the two? I don’t understand what it is that you’re so concerned about that you feel it is justified to throw a bunch of benchmarking into your application to defend against.

I really don’t understand your thinking around this. You have written complicated benchmarks and tests that exist solely for these multiple queue scenarios. You’ve written these tests and benchmarks despite the fact that no Vulkan implementation has ever offered multiple queue families with the same properties, and despite the fact that the Vulkan specification actively discourages it.

And yet, you’d be willing to forgo those tests and benchmarks just because of words written in a document?

I can understand adopting a defensive coding style that says “if a thing can happen, it will, so prepare for it.” But just because there’s ink on a page forbidding something doesn’t mean it can’t happen (see every OpenGL driver bug ever).

Basically, I don’t understand the particular kind of paranoia you have that is willing to ignore de facto rules like the Vulkan hardware database (which shows no multi-family scenarios), but is totally willing to put its faith in de jure rules.

You keep repeating “it already exists” as though I am unaware of it. I get that it exists, but without specific knowledge of what exactly “it” is, I consider “it already exists” to be essentially a meaningless statement. The existence of your code doesn’t prove:

That high-level resource management can work just as efficiently as a hand-coded solution for any graphics application. It merely proves that it does so for yours.
That high-level resource management doesn’t require lots and lots of code, above and beyond what already exists in Vulkan.
That this lots and lots of code will not lead to lots and lots of driver bugs and behavior variances across the various implementers. Remember: each implementer writes their implementation independently, whereas you write it exactly once.

What you’re talking about sounds like it could be an interesting Vulkan layer. But it shouldn’t be in Vulkan itself.

But every implementation would have to offer it as a possibility and thus code against it. Whether I personally use it or not, the implementers have to spend time writing, testing, and maintaining it. That’s time taken away from writing, testing, and maintaining code that I may actually be using.

But every implementation would have to be written to allow it.

I’m not concerned about what users have to do. I am concerned about what Vulkan implementations have to do.

For me, the biggest Achilles’s Heel for OpenGL was the complexity of implementations leading to loads of driver bugs. And thus, the biggest advantage of Vulkan is that drivers are very simple (by contrast). With a few exceptions, Vulkan drivers are essentially translation layers. They don’t a whole lot of thinking. They don’t keep track of state or anything; they do exactly as they are told.

If you can implement “completion ports” on top of Vulkan queues, then drivers have no business implementing them for you.

differentiable · January 6, 2018, 10:33am

Cue Weird Al “Everything You Know is Wrong”.

[QUOTE=Alfonse Reinheart;43013]So instead, we’ll investigate “all families distinct unless more are exposed by an extension”.

First, that’d have to be done as an instance extension rather than a device extension. After all, queues are activated at the same time as features and extensions: device creation time. So unless it is an instance extension, there’d be no way to active the extension and query the properties of said queue.

Second, I don’t understand how this helps you. What exactly is it that you’re afraid will happen if an implementation offered two queue families with the same properties? Are you scared that your code will pick the “slow queue” of the two? I don’t understand what it is that you’re so concerned about that you feel it is justified to throw a bunch of benchmarking into your application to defend against.
[/QUOTE]

It makes family selection decidable and the appearance of redundant families explicit. Even if there is exactly one family. All decidability requires is a-priory uniqueness of queueFlags per family. And I’ll correct myself on the last post: You’d need to specify which extensions expose additional families at instance creation, not device creation. Still, its explicit, involves a string/symbol that is documented somewhere, and can participate in error handling. If a device has just one queue, fine. That’s it. If a device has 3 distinct families, as long as their queueFlags are distinct, the problem of choosing a queue is immediately decidable.

[QUOTE=Alfonse Reinheart;43013]
You keep repeating “it already exists” as though I am unaware of it. I get that it exists, but without specific knowledge of what exactly “it” is, I consider “it already exists” to be essentially a meaningless statement. The existence of your code doesn’t prove:

That high-level resource management can work just as efficiently as a hand-coded solution for any graphics application. It merely proves that it does so for yours.
That high-level resource management doesn’t require lots and lots of code, above and beyond what already exists in Vulkan.
That this lots and lots of code will not lead to lots and lots of driver bugs and behavior variances across the various implementers. Remember: each implementer writes their implementation independently, whereas you write it exactly once.[/QUOTE]

I had more to add to that statement, but I ended up copying an older version of the text for some reason.

It was supposed to conclude with something along the lines of: “Case closed. Already done. Nobody needs to do it if they don’t want to.”

[QUOTE=Alfonse Reinheart;43013]
But every implementation would have to offer it [[completion ports]] as a possibility and thus code against it. Whether I personally use it or not, the implementers have to spend time writing, testing, and maintaining it. That’s time taken away from writing, testing, and maintaining code that I may actually be using.[/QUOTE]

Wrong. The same with swapchains. Nobody needs to support swapchains in the same way no vendor needed to support multitexturing in OpenGL.

[QUOTE=Alfonse Reinheart;43013]
I’m not concerned about what users have to do. I am concerned about what Vulkan implementations have to do.[/QUOTE]

I don’t have a good response for this. I’m absolutely positive that, at this point, regardless of how either of us express our opinions, nothing will happen. You are now openly embracing the status-quo, so you can celebrate when nothing happens.

[QUOTE=Alfonse Reinheart;43013]
For me, the biggest Achilles’s Heel for OpenGL was the complexity of implementations leading to loads of driver bugs. And thus, the biggest advantage of Vulkan is that drivers are very simple (by contrast). With a few exceptions, Vulkan drivers are essentially translation layers. They don’t a whole lot of thinking. They don’t keep track of state or anything; they do exactly as they are told.

If you can implement “completion ports” on top of Vulkan queues, then drivers have no business implementing them for you.[/QUOTE]

It can’t be done without a dedicated queue, because Vulkan owns the system objects involved, and therefore needs to be an extension. Just like VK_KHR_external_semaphore, which nobody needs to support.

I get your position. You’ll defend the status-quo to any length necessary. Maybe you had some part creating in it. Maybe you’re paid to keep people like me in a “safety-pen”. They wanted community feedback, and now they have it: A couple of very simple feature requests.

Please don’t offer any more defense of the status-quo. I would like to hear from other people, and your posts are cluttering up the thread.

krOoze · January 6, 2018, 5:43pm

Heh, I have problem keeping up here with you guys. That we discuss several different topics/features does not help; and digressions into evils of capitalism and conspiracy of vendors do not help either.
Though you mostly seemed to have been beating that Paradigm 1 horse.

Images can also be sub-allocations in Vulkan, no?

You need to define the guarantee you are missing in an articulate way. Or provide the sequence of Vulkan commands that won’t work as you want.

You feel the need to repeat this sentiment, and I am sorry. But please don’t confuse scrutiny and prodding for “revulsion or confusion”.

It is not a “settle” situation. Extension is more like a first step in the Peter principle. If there is any revulsion, it is to avoid the last step in the same principle.

IKR. I think I screwed it up when I picked my non-billionaire parents.

That’s nice, but not sufficient nor necessary reason for addition in of itself. If I was cheeky, I would say that “nobody needs to” jump off of a cliff either.

Building an app on asinine device is like building a skyscraper on sand. You should not waste too much time trying. Nor should Vulkan specification.

The specification says implementors should(RFC) not do that. It is reserved for some experimental or specialized devices.
You should not waste too much time prematurely optimizing for that case if the chance of encountering such device infinitesimaly reaches zero.

[QUOTE=differentiable;43006]We want a standard alternative to a maintenance hazard called: Buy everything new and test it, or hire more people to fill in the blanks. We don’t have the time for that.
[/QUOTE]

That’s nice (if you can pull it off). Conventionally, it seems to me, premature future-proofing costs more time than reacting to current situation. Also convetionally there is another helpful force: newer devices are faster, so even if the old software runs inefficiently, it still runs faster, so nobody complains.

It still may be. We only got e.g. conservative rasterization like now; to this day validation layers are not complete, etc…

Which objects except the fences does it need?
Efficiency is a good way to get around Paradigm 1, though bit hard to prove. The app side implementation would require something like a thread pool and a one fence-wait per thread thing. That sounds bad, but I am not sure driver would not have to do something similar.

Like we have a choice… Alternative is to trust vendors to always be inconsistent and unreasonable; that’s no way to live.

That’s contradictory.
Either it is arbitrary, or the API can reveal something about it. Can’t have both.

What topology description you suggest? There’s device type, and there are Queue families. What else is there common to all contemporary devices?

I like that somewhat better than the bind solution; more explicit when the VkCompletionInfoEXT is accessed.
Can it be made without the need for intenal synchronization, i.e. the VkCompletionInfoEXT marked as “externally synchronized”? I mean it looks at least the wait operation would have to be an exception to that, which is annoying to introduce into Vulkan that does generally avoid that except the Pipeline cache.

Meh.
Actually you don’t have to if you dont want to. It would be similar to the pNext way. Except instead of it you would call something like vkAddFenceToCompletionPort() after the submission. As a bonus there’s even more control of when the VkCompletionInfoEXT object is accessed. It would also allow us to know which task actually completed on vkWaitForOneCompleteTask(cp, &completed_task_fence).

differentiable · January 8, 2018, 10:20am

[QUOTE=krOoze;43015]Heh, I have problem keeping up here with you guys. That we discuss several different topics/features does not help; and digressions into evils of capitalism and conspiracy of vendors do not help either.
Though you mostly seemed to have been beating that Paradigm 1 horse.[/QUOTE]

All my fault! I keep leaving my tinfoil hat on; my head feels lonely without it.

[QUOTE=krOoze;43015]
Images can also be sub-allocations in Vulkan, no?[/QUOTE]

Kinda sorta. If they are all the same size, or within a certain range of size, then we can dole out array elements and defrag within an image (if we need arrays in arrays). The other problem is layer-only or whole-mip-level-only transfer granularity from the host. While some queues support sub-image transfers, they tend to be slower than whole level (setup complexity aside).

Also, we run into a related problem with slab allocators (like jemalloc) - how much do we allocate up-front? Some runs need only one image of a particular size, the rest being much smaller or larger, and the rest of the slab would be wasted space. It works well for a game where all the assets are organized in homogeneous categories known before startup, but this is a world in which anything can happen (within reason). Analogously, where jemalloc is suitable for a server with a somewhat predictable workload of mostly large but transient allocations, its a bad idea for CAD software, anything that needs arbitrary precision, etc…

Besides, if we can just create a whole image anyway, the client code can decide how it uses it through view configs. A plugin may actually do this if it is needed (while the concept permits arbitrary nesting - you can have sub-allocations within sub-allocations, etc… it is nice to keep the already large overhead as small as possible).

[QUOTE=krOoze;43015]
You need to define the guarantee you are missing in an articulate way. Or provide the sequence of Vulkan commands that won’t work as you want.[/QUOTE]

Alias an image with a buffer. The validation layer will complain (as required), but this allows moving image data with buffer transfers. Intuition suggests that, since the data is all the same size, and nothing is compressed, it should be just as POD as it was when it was transferred in. As long as it moves around respecting the granularity, we can just plant a new image on the relocated data and pretend it has the old layout. Its worked so far, it would just be nice to have a way to know when this is actually OK. I’m not sure if the query would apply to the memory space, or just to a layout, or a combination, etc… I’m still not settled on what the ideal would be.

[QUOTE=krOoze;43015]
You feel the need to repeat this sentiment, and I am sorry. But please don’t confuse scrutiny and prodding for “revulsion or confusion”.[/QUOTE]

It comes from a long history of interacting with the snooty ends of many different communities. They have two modes of operation: 1) They are right. 2) You are wrong. Anything else is a long-winded short-circuit.

[QUOTE=krOoze;43015]
It is not a “settle” situation. Extension is more like a first step in the Peter principle. If there is any revulsion, it is to avoid the last step in the same principle.[/QUOTE]

I was trying evoke an air of accomodation, or the emphasize that this isn’t a personal mission. Probably not effective there.

[QUOTE=krOoze;43015]
That’s nice, but not sufficient nor necessary reason for addition in of itself. If I was cheeky, I would say that “nobody needs to” jump off of a cliff either.[/QUOTE]

That was in response to “but, then everybody has to support it” and other blatant appeals to complexity. I was surprised to read that nobody has to support swapchains or compute. Then we have this weird animal: KHR_external_semaphore. Why not KHR_external_fence, too? Same problem with either extension (more platform code everywhere), but such is life.

I was thinking of using KHR_external_semaphore for an atomicity guarantee on transactions, except that there does not appear to be a way to cancel a submission (block first submits until all dependent submits have been made). So, if a submit errors out, I’d still have to wait for all prior to complete anyway. From the dispatcher’s view, it turns out to be useless, since it already knows about submission order up-front, and all applicable barriers exist in every command stream already, along with semaphores controlling that dispatch between queue submissions.

[QUOTE=krOoze;43015]
Building an app on asinine device is like building a skyscraper on sand. You should not waste too much time trying. Nor should Vulkan specification.

The specification says implementors should(RFC) not do that. It is reserved for some experimental or specialized devices.
You should not waste too much time prematurely optimizing for that case if the chance of encountering such device infinitesimaly reaches zero.[/QUOTE]

Fuzz testing doesn’t require the underlying test to make sense, it just needs to be compliant. Crazy tests tend to find bugs in unexpected places. Vulkan layering has been great so far, in that we can just build a device fuzzer on top of the same interface without having to make our own fake ICD (as with OpenGL, and OpenGL is hard to fuzz). This was one of the intended purposes, I believe.

As far as queue families go: I imagine it would be annoying to have to make an extension for every specialized device out there, just for extra families. Maybe “the first N families must be distinct, anything extra may follow”.

Say order families such that some simple example algorithm always make the “best” choice in the general case:



// Member function of "Device"
bool ChooseQueueFamily( VkQueueFlags includeAll, VkQueueFlags excludeAll, uint32_t &indexOut ) const noexcept
{
    ASSERT( 0 == (includeAll & excludeAll) );

    for( uint32_t i = 0; i < this->queueFamilies.size(); ++i )
    {
        // note: families were pre-processed to include VK_QUEUE_TRANSFER_BIT wherever compute or graphics are available

        const auto &family = this->queueFamilies[i];
        if( includeAll == (family.queueFlags & includeAll)
        &&           0 == (family.queueFlags & excludeAll) )
        {
            indexOut = i;
            return true;
        }
    }
    return false;
}

This is so a naiive application won’t accidentally select an experimental feature.

About "being reasonable": You never know, sadly.

It could be company A paying off a vendor to screw over company B’s new product with a few “bugs”, the vendor takes it and writes in some “bugs” that happen with products of B, and randomly elsewhere.

A inc. was spying on B’s development, and even provided enough source/details to the vendor so they could leave no trace of foul play. Since A obviously knows about the problems they wanted, they are prepared long before release.

Release comes, everyone updates their drivers, ushering in an endless stream of complaints in B’s forums. Then, the rest of us have the pleasure of finding said “bugs” randomly because our software happened to trigger them.

A inc. continues to pay said vendor to wait just long enough for B’s stock to start dropping (or their kickstarter to go sour), then allows the silent release of a fix. Customers of A probably won’t notice or care. Company A doesn’t make money caring about anything but itself, and they most likely won’t be implicated anyway.

Corporate espionage and sabotage are still very real things today, and can involve many entities outside the parties in direct competition.

Laws on the books come with a tacit “applicable only if caught” clause.

[QUOTE=krOoze;43015]
That’s nice (if you can pull it off). Conventionally, it seems to me, premature future-proofing costs more time than reacting to current situation.[/QUOTE]

Fuzz layer: About 1 person-week. Its infinitely tweakable!
Benchmarks: About 3 person-weeks. Everyone should have these.
Scheduler that profiles queues using said benchmarks: An afternoon?

[QUOTE=krOoze;43015]
Which objects except the fences does it need?
Efficiency is a good way to get around Paradigm 1, though bit hard to prove. The app side implementation would require something like a thread pool and a one fence-wait per thread thing. That sounds bad, but I am not sure driver would not have to do something similar.[/QUOTE]

Completion ports do not need fences. They are complimentary to one another in our case: Fences are used for backing out of a transaction after an error, and if we had a completion port, that is what the “pop thread” would be waiting for.

In concurrent programming there is a recurring pattern in task execution: Branch -> Join -> Branch -> Join, etc… A “completion port”, really just a special kind of message queue, is a way to implement a join. Vulkan expects applications to know about all submissions that will occurr at any given time, hence vkWaitForFences.

When you have an unknown number of transactions completing at unknown times (concurrent resource uploads from different plugins), you have to either expose all of the respective clients to one another by having them all share some kind of transaction object (a source of nasty sync problems), or you can just let them submit whatever whenever and deal with completion as it happens. Like a server, for instance: You don’t know when someone will connect, or how many will connect within any given interval of time. A device will finish any submission at any time (of course respecting semaphores, etc…), and we don’t want to keep task workspaces sitting around any longer than they have to.

[QUOTE=krOoze;43015]
That’s contradictory.
Either it is arbitrary, or the API can reveal something about it. Can’t have both.

What topology description you suggest? There’s device type, and there are Queue families. What else is there common to all contemporary devices?[/QUOTE]

I had in my mind that there might be a device with queue families that can only touch resources in a certain kind of memory. Not sure if this is mentioned anywhere, but transfer from any family A to any family B doesn’t seem to be prohibited in general (for exclusive mode, anyway). Some of my rebuttals were admittedly more frustrated than informed.

[QUOTE=krOoze;43015]
I like that somewhat better than the bind solution; more explicit when the VkCompletionInfoEXT is accessed.
Can it be made without the need for intenal synchronization, i.e. the VkCompletionInfoEXT marked as “externally synchronized”? I mean it looks at least the wait operation would have to be an exception to that, which is annoying to introduce into Vulkan that does generally avoid that except the Pipeline cache.[/QUOTE]

More annoying, and confusing, is that VkQueue needs external synchronization. If they’re using any kind of linked list internally, a simple application mistake could lead to a very hard-to-diagnose “lost submission” bug or similar. This is why we went through all the trouble of hiding queue handles behind a scheduler interface.

The reason I suggested the bind-based solution is that I imagined drivers would use fds for queues internally, and on Windows, an fd can only be bound once to a completion port (However, since Win 8, its been possible to change the binding, but it looks hacky).

I think there is a way to make a userland-only “CookieQueue” that allows multiple concurrent push and batched pop. This would work for drivers that keep Vulkan there (all of them?) and employ what looks like “push” and “pop” threads (NVidia drivers appear to do this).

I’ve made one for a single-consumer scenario before, and I’ll post a simplified version of it when I can get around to it. There is a way to avoid the “unbounded submit” problem by placing an upper-limit on the number of in-flight cookies, which means applicable vkQueueSubmit will block until a cookie slot is available.

In the meantime, here’s a rehash:



// Chained through VkSubmitInfo::pNext
struct VkCompletionInfoEXT
{
    VkStructureType sType; // VK_STRUCTURE_TYPE_COMPLETION_INFO_EXT
    void *pNext;
    VkCompletionPortEXT port;
    uintptr_t cookie;
    uint64_t timeout; // max time to wait for an available cookie slot (ns)
};

struct VkCompletionPortCreateInfoEXT
{
    VkStructureType sType; // VK_STRUCTURE_TYPE_COMPLETION_PORT_CREATE_INFO_EXT
    void *pNext;
    VkFlags flags; // not used, but here because reasons
    uint32_t maxPendingCompletions; // max # of cookie slots
    
    // Since a completion port is a one-time allocation, the max # of
    // concurrent submits supported must be specified up-front.
    //
    // A reasonable starting point:
    //
    // maxPendingCompletions =
    // 4 * std::thread::hardware_concurrency() * queueFamilyCount;
    //
    // Otherwise, an allocator would need to be invoked with every
    // submit, which is costly.
};

VkResult vkCreateCompletionPortEXT(
    VkDevice device,
    VkCompletionPortCreateInfoEXT *pCreateInfo,
    VkAllocationCallbacks *pAllocator,
    VkCompletionPortEXT *pCompletionPort );
    
void vkDestroyCompletionPortEXT(
    VkDevice device,
    VkCompletionPortEXT port,
    VkAllocationCallbacks *pAllocator );
    
VkResult vkWaitForCompletionEXT(
    VkDevice device,
    VkCompletionPortEXT port,
    uint32_t *pInOutCookieCount,
    uintptr_t *pCookies,
    VkBool32 bWaitForAll,
    uint64_t timeout );
    
// For thread control
VkResult vkPostCompletionEXT(
    VkDevice device,
    VkCompletionPortEXT port,
    uintptr_t cookie,
    uint64_t timeout );

Queue submission:



VkCompletionInfoEXT completionInfo;
ZeroFill(completionInfo);
completionInfo.sType = VK_STRUCTURE_TYPE_COMPLETION_INFO_EXT;
completionInfo.port = hCompletionPort;
completionInfo.cookie = reinterpret_cast<uintptr_t>(pTaskWorkspace);
completionInfo.timeout = 1000000000u; // wait for at most 1 second.

VkSubmitInfo submitInfo;
//
// blah...
//
submitInfo.pNext = &completionInfo;

// If, after 1 second, an internal cookie slot doesn't become
// available, this will return VK_TIMEOUT.
auto result = vkQueueSubmit( hQueue, 1u, &submitInfo, VK_NULL_HANDLE );
HANDLE_VK_ERROR( result );

// ^^^ last submit in chain of submits, no need for fence (you can still use one, though).

Waiting for cookies:



#define TERMINATION_COOKIE_VALUE ((uintptr_t)1)

#define POP_BATCH_SIZE 16

uintptr_t cookies[POP_BATCH_SIZE];
do
{
    uint32_t cookieCount = POP_BATCH_SIZE;

    // on input: specify max # of cookies to grab in one call,
    // or exactly that # if bWaitForAll is VK_TRUE
    //
    // on output: # of cookies popped
    //
    auto result = vkWaitForCompletionEXT( hDevice, hPort, &cookieCount, cookies, VK_FALSE, UINT64_MAX );
    HANDLE_VK_ERROR( result );
    
    ASSERT( cookieCount <= POP_BATCH_SIZE );
    
    for( uint32_t i = 0; i < cookieCount; ++i )
    {
        if( TERMINATION_COOKIE_VALUE == cookies[i] )
        {
            return; // task cleanup in joining thread
        }

        // cast to pointer
        // remove from pending list
        // unlock a mutex
        // recycle workspace
        // etc...
    }
}
while(true);