[QUOTE=krOoze;43015]Heh, I have problem keeping up here with you guys. That we discuss several different topics/features does not help; and digressions into evils of capitalism and conspiracy of vendors do not help either.
Though you mostly seemed to have been beating that Paradigm 1 horse.[/QUOTE]
All my fault! I keep leaving my tinfoil hat on; my head feels lonely without it.
[QUOTE=krOoze;43015]
Images can also be sub-allocations in Vulkan, no?[/QUOTE]
Kinda sorta. If they are all the same size, or within a certain range of size, then we can dole out array elements and defrag within an image (if we need arrays in arrays). The other problem is layer-only or whole-mip-level-only transfer granularity from the host. While some queues support sub-image transfers, they tend to be slower than whole level (setup complexity aside).
Also, we run into a related problem with slab allocators (like jemalloc) - how much do we allocate up-front? Some runs need only one image of a particular size, the rest being much smaller or larger, and the rest of the slab would be wasted space. It works well for a game where all the assets are organized in homogeneous categories known before startup, but this is a world in which anything can happen (within reason). Analogously, where jemalloc is suitable for a server with a somewhat predictable workload of mostly large but transient allocations, its a bad idea for CAD software, anything that needs arbitrary precision, etc…
Besides, if we can just create a whole image anyway, the client code can decide how it uses it through view configs. A plugin may actually do this if it is needed (while the concept permits arbitrary nesting - you can have sub-allocations within sub-allocations, etc… it is nice to keep the already large overhead as small as possible).
[QUOTE=krOoze;43015]
You need to define the guarantee you are missing in an articulate way. Or provide the sequence of Vulkan commands that won’t work as you want.[/QUOTE]
Alias an image with a buffer. The validation layer will complain (as required), but this allows moving image data with buffer transfers. Intuition suggests that, since the data is all the same size, and nothing is compressed, it should be just as POD as it was when it was transferred in. As long as it moves around respecting the granularity, we can just plant a new image on the relocated data and pretend it has the old layout. Its worked so far, it would just be nice to have a way to know when this is actually OK. I’m not sure if the query would apply to the memory space, or just to a layout, or a combination, etc… I’m still not settled on what the ideal would be.
[QUOTE=krOoze;43015]
You feel the need to repeat this sentiment, and I am sorry. But please don’t confuse scrutiny and prodding for “revulsion or confusion”.[/QUOTE]
It comes from a long history of interacting with the snooty ends of many different communities. They have two modes of operation: 1) They are right. 2) You are wrong. Anything else is a long-winded short-circuit.
[QUOTE=krOoze;43015]
It is not a “settle” situation. Extension is more like a first step in the Peter principle. If there is any revulsion, it is to avoid the last step in the same principle.[/QUOTE]
I was trying evoke an air of accomodation, or the emphasize that this isn’t a personal mission. Probably not effective there.
[QUOTE=krOoze;43015]
That’s nice, but not sufficient nor necessary reason for addition in of itself. If I was cheeky, I would say that “nobody needs to” jump off of a cliff either.[/QUOTE]
That was in response to “but, then everybody has to support it” and other blatant appeals to complexity. I was surprised to read that nobody has to support swapchains or compute. Then we have this weird animal: KHR_external_semaphore. Why not KHR_external_fence, too? Same problem with either extension (more platform code everywhere), but such is life.
I was thinking of using KHR_external_semaphore for an atomicity guarantee on transactions, except that there does not appear to be a way to cancel a submission (block first submits until all dependent submits have been made). So, if a submit errors out, I’d still have to wait for all prior to complete anyway. From the dispatcher’s view, it turns out to be useless, since it already knows about submission order up-front, and all applicable barriers exist in every command stream already, along with semaphores controlling that dispatch between queue submissions.
[QUOTE=krOoze;43015]
Building an app on asinine device is like building a skyscraper on sand. You should not waste too much time trying. Nor should Vulkan specification.
The specification says implementors should(RFC) not do that. It is reserved for some experimental or specialized devices.
You should not waste too much time prematurely optimizing for that case if the chance of encountering such device infinitesimaly reaches zero.[/QUOTE]
Fuzz testing doesn’t require the underlying test to make sense, it just needs to be compliant. Crazy tests tend to find bugs in unexpected places. Vulkan layering has been great so far, in that we can just build a device fuzzer on top of the same interface without having to make our own fake ICD (as with OpenGL, and OpenGL is hard to fuzz). This was one of the intended purposes, I believe.
As far as queue families go: I imagine it would be annoying to have to make an extension for every specialized device out there, just for extra families. Maybe “the first N families must be distinct, anything extra may follow”.
Say order families such that some simple example algorithm always make the “best” choice in the general case:
// Member function of "Device"
bool ChooseQueueFamily( VkQueueFlags includeAll, VkQueueFlags excludeAll, uint32_t &indexOut ) const noexcept
{
ASSERT( 0 == (includeAll & excludeAll) );
for( uint32_t i = 0; i < this->queueFamilies.size(); ++i )
{
// note: families were pre-processed to include VK_QUEUE_TRANSFER_BIT wherever compute or graphics are available
const auto &family = this->queueFamilies[i];
if( includeAll == (family.queueFlags & includeAll)
&& 0 == (family.queueFlags & excludeAll) )
{
indexOut = i;
return true;
}
}
return false;
}
This is so a naiive application won’t accidentally select an experimental feature.
About "being reasonable": You never know, sadly.
It could be company A paying off a vendor to screw over company B’s new product with a few “bugs”, the vendor takes it and writes in some “bugs” that happen with products of B, and randomly elsewhere.
A inc. was spying on B’s development, and even provided enough source/details to the vendor so they could leave no trace of foul play. Since A obviously knows about the problems they wanted, they are prepared long before release.
Release comes, everyone updates their drivers, ushering in an endless stream of complaints in B’s forums. Then, the rest of us have the pleasure of finding said “bugs” randomly because our software happened to trigger them.
A inc. continues to pay said vendor to wait just long enough for B’s stock to start dropping (or their kickstarter to go sour), then allows the silent release of a fix. Customers of A probably won’t notice or care. Company A doesn’t make money caring about anything but itself, and they most likely won’t be implicated anyway.
Corporate espionage and sabotage are still very real things today, and can involve many entities outside the parties in direct competition.
Laws on the books come with a tacit “applicable only if caught” clause.
[QUOTE=krOoze;43015]
That’s nice (if you can pull it off). Conventionally, it seems to me, premature future-proofing costs more time than reacting to current situation.[/QUOTE]
Fuzz layer: About 1 person-week. Its infinitely tweakable!
Benchmarks: About 3 person-weeks. Everyone should have these.
Scheduler that profiles queues using said benchmarks: An afternoon?
[QUOTE=krOoze;43015]
Which objects except the fences does it need?
Efficiency is a good way to get around Paradigm 1, though bit hard to prove. The app side implementation would require something like a thread pool and a one fence-wait per thread thing. That sounds bad, but I am not sure driver would not have to do something similar.[/QUOTE]
Completion ports do not need fences. They are complimentary to one another in our case: Fences are used for backing out of a transaction after an error, and if we had a completion port, that is what the “pop thread” would be waiting for.
In concurrent programming there is a recurring pattern in task execution: Branch -> Join -> Branch -> Join, etc… A “completion port”, really just a special kind of message queue, is a way to implement a join. Vulkan expects applications to know about all submissions that will occurr at any given time, hence vkWaitForFences
.
When you have an unknown number of transactions completing at unknown times (concurrent resource uploads from different plugins), you have to either expose all of the respective clients to one another by having them all share some kind of transaction object (a source of nasty sync problems), or you can just let them submit whatever whenever and deal with completion as it happens. Like a server, for instance: You don’t know when someone will connect, or how many will connect within any given interval of time. A device will finish any submission at any time (of course respecting semaphores, etc…), and we don’t want to keep task workspaces sitting around any longer than they have to.
[QUOTE=krOoze;43015]
That’s contradictory.
Either it is arbitrary, or the API can reveal something about it. Can’t have both.
What topology description you suggest? There’s device type, and there are Queue families. What else is there common to all contemporary devices?[/QUOTE]
I had in my mind that there might be a device with queue families that can only touch resources in a certain kind of memory. Not sure if this is mentioned anywhere, but transfer from any family A to any family B doesn’t seem to be prohibited in general (for exclusive mode, anyway). Some of my rebuttals were admittedly more frustrated than informed.
[QUOTE=krOoze;43015]
I like that somewhat better than the bind solution; more explicit when the VkCompletionInfoEXT
is accessed.
Can it be made without the need for intenal synchronization, i.e. the VkCompletionInfoEXT
marked as “externally synchronized”? I mean it looks at least the wait operation would have to be an exception to that, which is annoying to introduce into Vulkan that does generally avoid that except the Pipeline cache.[/QUOTE]
More annoying, and confusing, is that VkQueue
needs external synchronization. If they’re using any kind of linked list internally, a simple application mistake could lead to a very hard-to-diagnose “lost submission” bug or similar. This is why we went through all the trouble of hiding queue handles behind a scheduler interface.
The reason I suggested the bind-based solution is that I imagined drivers would use fds for queues internally, and on Windows, an fd can only be bound once to a completion port (However, since Win 8, its been possible to change the binding, but it looks hacky).
I think there is a way to make a userland-only “CookieQueue
” that allows multiple concurrent push and batched pop. This would work for drivers that keep Vulkan there (all of them?) and employ what looks like “push” and “pop” threads (NVidia drivers appear to do this).
I’ve made one for a single-consumer scenario before, and I’ll post a simplified version of it when I can get around to it. There is a way to avoid the “unbounded submit” problem by placing an upper-limit on the number of in-flight cookies, which means applicable vkQueueSubmit
will block until a cookie slot is available.
In the meantime, here’s a rehash:
// Chained through VkSubmitInfo::pNext
struct VkCompletionInfoEXT
{
VkStructureType sType; // VK_STRUCTURE_TYPE_COMPLETION_INFO_EXT
void *pNext;
VkCompletionPortEXT port;
uintptr_t cookie;
uint64_t timeout; // max time to wait for an available cookie slot (ns)
};
struct VkCompletionPortCreateInfoEXT
{
VkStructureType sType; // VK_STRUCTURE_TYPE_COMPLETION_PORT_CREATE_INFO_EXT
void *pNext;
VkFlags flags; // not used, but here because reasons
uint32_t maxPendingCompletions; // max # of cookie slots
// Since a completion port is a one-time allocation, the max # of
// concurrent submits supported must be specified up-front.
//
// A reasonable starting point:
//
// maxPendingCompletions =
// 4 * std::thread::hardware_concurrency() * queueFamilyCount;
//
// Otherwise, an allocator would need to be invoked with every
// submit, which is costly.
};
VkResult vkCreateCompletionPortEXT(
VkDevice device,
VkCompletionPortCreateInfoEXT *pCreateInfo,
VkAllocationCallbacks *pAllocator,
VkCompletionPortEXT *pCompletionPort );
void vkDestroyCompletionPortEXT(
VkDevice device,
VkCompletionPortEXT port,
VkAllocationCallbacks *pAllocator );
VkResult vkWaitForCompletionEXT(
VkDevice device,
VkCompletionPortEXT port,
uint32_t *pInOutCookieCount,
uintptr_t *pCookies,
VkBool32 bWaitForAll,
uint64_t timeout );
// For thread control
VkResult vkPostCompletionEXT(
VkDevice device,
VkCompletionPortEXT port,
uintptr_t cookie,
uint64_t timeout );
Queue submission:
VkCompletionInfoEXT completionInfo;
ZeroFill(completionInfo);
completionInfo.sType = VK_STRUCTURE_TYPE_COMPLETION_INFO_EXT;
completionInfo.port = hCompletionPort;
completionInfo.cookie = reinterpret_cast<uintptr_t>(pTaskWorkspace);
completionInfo.timeout = 1000000000u; // wait for at most 1 second.
VkSubmitInfo submitInfo;
//
// blah...
//
submitInfo.pNext = &completionInfo;
// If, after 1 second, an internal cookie slot doesn't become
// available, this will return VK_TIMEOUT.
auto result = vkQueueSubmit( hQueue, 1u, &submitInfo, VK_NULL_HANDLE );
HANDLE_VK_ERROR( result );
// ^^^ last submit in chain of submits, no need for fence (you can still use one, though).
Waiting for cookies:
#define TERMINATION_COOKIE_VALUE ((uintptr_t)1)
#define POP_BATCH_SIZE 16
uintptr_t cookies[POP_BATCH_SIZE];
do
{
uint32_t cookieCount = POP_BATCH_SIZE;
// on input: specify max # of cookies to grab in one call,
// or exactly that # if bWaitForAll is VK_TRUE
//
// on output: # of cookies popped
//
auto result = vkWaitForCompletionEXT( hDevice, hPort, &cookieCount, cookies, VK_FALSE, UINT64_MAX );
HANDLE_VK_ERROR( result );
ASSERT( cookieCount <= POP_BATCH_SIZE );
for( uint32_t i = 0; i < cookieCount; ++i )
{
if( TERMINATION_COOKIE_VALUE == cookies[i] )
{
return; // task cleanup in joining thread
}
// cast to pointer
// remove from pending list
// unlock a mutex
// recycle workspace
// etc...
}
}
while(true);