Official Vulkan Feedback: API for High-efficiency Graphics and Compute on GPUs

Hi Y’all,

My request to the committee working on Vulkan:

Please consider releasing some kind of information on Vulkan resource management.

For those of us who are (1) looking to port from GL/GLES to Vulkan and (2) aren’t already running on consoles, resource management is going to be a totally new feature for our app. I for one would like more situational awareness, e.g. what does the abstraction look like, what problems does the app have to solve that the driver used to manage, etc.

I understand the desire to not thrash us with “un-baked” stuff, but right now if you don’t have Mantle under NDA, resource management is just a big question mark.

cheers
Ben

While we don’t know everything about memory, we know quite a bit. From the slides and presentation, we know:

  1. Memory is handled separately from buffers and textures. These use memory, but they are not created owning any particular piece.

  2. GPU’s expose explicit memory pools, which have different (and explicit) limitations.

  3. Memory pools explicit size limits. IE: You can run out of allocated memory.

  4. You can ask a texture how much memory it needs, which is based on its sizing information and internal format.

  5. There is some kind of DMA queue. Probably for DMAing :wink:

  6. There is a notion of memory residency, which associates memory with a particular queue.

From this, we can surmise the following:

  1. Textures/buffers can have their current memory removed and reassigned to other textures/buffers.

  2. You can’t use execute operations that use certain memory through a particular queue, unless that memory is resident in that queue. So you need to assign/unassign memory residency to queues before executing commands through them. It is the application’s responsibility to ensure this.

  3. The residency function for a queue doesn’t appear to take a memory range, so residency appears to be all-or-nothing for an entire allocation region.

Can you be more specific about what you’re looking for?

How will higher level language features be handled in the API’s language bindings for higher level languages?
I mean function overloading in particular. Also method / member function versus function style notation
(Including features such as UFCS http://ddili.org/ders/d.en/ufcs.html)
(Other features such as template programming?)

Good naming and function overloading conventions for higher level language bindings can increase API uniformity. Avoids every programming language adding it’s own function overloading naming style for the API. Almost creating their own API from an application programmer standpoint.

Regarding API design:
Also make sure Vulkan can be used in a functional, procedural programming style manner.

It’s a C interface, so there won’t be any overloading. Or member functions.

While it is a C api (no overloaded functions) every first parameter is the object it works on and the name starts with vk followed by the type it works on and then the function name itself. There will be some exceptions but that’s the general gist.

So wrapping with C++ (or any oo language) will be very easy.

Oh, and some other things about memory that we know:

  1. You have to make sure that memory objects are no longer being used by any outstanding commands when you go to delete them. Otherwise badness will occur.

  2. The same goes for textures/buffers (and any other Vulkan objects too).

the name starts with vk followed by the type it works on and then the function name itself. There will be some exceptions but that’s the general gist.

There are rather more than “some”. I went through the presentation slides, trying to develop an algorithm to detect whether a function should be considered a “member” of a particular type based solely on its name.

There were certain words that could denote this, but it wasn’t consistent. The Create/Destroy functions only mention the object types they create/destroy, not which object those functions are a member of. You create a buffer from a device, but the function doesn’t mention “Device” anywhere.

That being said, would you really want to call the function vkDeviceCreateBuffer? Or vkMemoryMapMemory? It certainly is consistent, but that doesn’t make it necessarily good.

I would therefore suggest that Khronos publish something like gl.xml for Vulkan’s API. Each function should state in its metadata which object it is conceptually a member of. Also, there should be some notion for each function whether it is conceptually a “const” member function (ie: doesn’t modify the object).

[QUOTE=Alfonse Reinheart;31162]Oh, and some other things about memory that we know:

  1. You have to make sure that memory objects are no longer being used by any outstanding commands when you go to delete them. Otherwise badness will occur.

  2. The same goes for textures/buffers (and any other Vulkan objects too).

There are rather more than “some”. I went through the presentation slides, trying to develop an algorithm to detect whether a function should be considered a “member” of a particular type based solely on its name.

There were certain words that could denote this, but it wasn’t consistent. The Create/Destroy functions only mention the object types they create/destroy, not which object those functions are a member of. You create a buffer from a device, but the function doesn’t mention “Device” anywhere.

That being said, would you really want to call the function vkDeviceCreateBuffer? Or vkMemoryMapMemory? It certainly is consistent, but that doesn’t make it necessarily good.

I would therefore suggest that Khronos publish something like gl.xml for Vulkan’s API. Each function should state in its metadata which object it is conceptually a member of. Also, there should be some notion for each function whether it is conceptually a “const” member function (ie: doesn’t modify the object).[/QUOTE]

I wouldn’t mind DeviceCreateBuffer or MemoryMap,

I hope that they will remain consistent if only to avoid the resultant mess 2 major versions from now after half again as many functions need to be added as happened in openGL

Hi Alfonse,

Thanks for the good write-up on what is known about memory so far. I re-watched the talk; it looks like:

  • Chunks of memory are explicitly allocated - so we either know we got the 1 GB of texture memory we need, or we didn’t - and we can know that early.
  • Sub-regions of GPU memory are used to back a resource using some kind of bind call (e.g. vkBindObjectMemory).
  • GPU memory is only guaranteed to be resident on the GPU between calls to vkQueueAddMemReference and vkQueueRemoveMemReference.

I thought I saw a recent a slide describing the types of memory available, but now I can’t find it. :frowning:

So I think my questions are:

  • What is the expensive operation? Binding an object to a GPU memory region, adding the memory region’s reference to a queue, or both? When am I actually paying for the cost of getting my texture over the bus?
  • If memory has to be re-used (e.g. I have bus bandwidth but not a lot of GPU memory) how do I evict? By swapping which memory objects are ref’d by the queue or by binding different objects into the same memory area?
  • Which of these operations are synchronous and which are asynchronous? (It looks like vkBindObjectMemory must be synchronous because there’s no queue or command buffer installed.)

I also didn’t see any indication of what the “command” is for buffer copies/DMA operations. Naively I would expect DMA to work by scheduling a command that copies from one buffer to another, where one buffer is backed by system memory and one is backed by GPU memory; the buffer that is the target of the DMA would be usable after the command buffer containing the DMA is known to have finished. But I don’t think we’ve seen slides or sample code yet.

cheers
Ben

  • What is the expensive operation? Binding an object to a GPU memory region, adding the memory region’s reference to a queue, or both? When am I actually paying for the cost of getting my texture over the bus?

Note: What follows is, at best, semi-educated guess work. Take it for what you will.

I always assumed that when you allocate memory from a particular pool, you were allocating memory from that pool. So if that pool is “across the bus”, then that’s where the memory is. Which means that the “expensive operation” would be doing the DMA or map-and-write or whatever operations there are to write to that memory.

I guess that the whole queue memory reference thing is primarily about virtual GPU memory. The memory may have been paged out by whatever processes exist on that system for controlling the virtual GPU address space. Thus, when you reference a piece of memory, you’re saying to page that memory back in if it’s out. Of course, that would require some notion of a difference between physical memory limitations and virtual memory limits.

Again, all very speculative, but something to think about.

However, adding to the questions about memory:

  • If the above is true, does memory allocation happen along page-aligned sizes? If so, do we get to query what the alignment size for each pool is?

  • If the above is true, do we get to query if a particular memory pool is virtualized or not?

The GDC slides mention that command queues can signal and wait for semaphors (page 30, vkQueueSignalSemaphore and vkQueueWaitSemaphore). Will these be operating system semaphors like HANDLEs on Windows and file descriptors (eventfds) on Linux? Or will it be a semaphor object private to Vulkan? And will DMA queues also have this ability?

Having operating system semaphors would make it possible to submit a bunch of stuff (texture uploads, command buffers, etc.) and signal smeaphors once thats done. Then the application can react to the signaled semaphor and decide what to do next. But it doesn’t have to wait during that time and could react to other application specific events while the GPU is busy. And with operating system semaphors you can do pretty much anything during that time. Reacting to other threads that deliver their finished work, shuffle around incoming audio or video buffers, distribute pending network packages, etc. Nothing of that is CPU intensive work, just stuff that needs to be coordinated. Such an event loop can avoid quite a bit of synchronization overhead and make it easier to properly prioritize different tasks.

In OpenCL I can use the clSetEventCallback function to signal OS semaphors when an event is signaled. But in OpenGL the closes thing I could find was the Issue 18. B6 of the GL_ARB_sync extension. I found some talk about a wglConvertSyncToEvent function that could convert a sync object to something you could use in the OS synchronization functions (WaitForSingleObject, MsgWaitForMultipleObjects, poll, …). But with sync objects being quite fine grained it would be very costy or impossible to have them signal the CPU and go through the whole OS stack. So OpenGL got it’s own thread. In my case not because it needs CPU time but because it uses a different kind of “waiting” than the operating system. As far as I understood synchronization with Vulkan comman queues is much more coarse grained than events within a command buffer. So it would be very nice if they would work well together with OS synchronization functions.

Right now I’ve mostly done video applications where I use OpenGL to do some image processing and composite many different video sources into one output video. There OpenGL is just one building block and the program needs to react to many different things going on around it. I’ve also done some simple games and 2D GUI stuff with OpenGL and there it’s not much of a problem. So I guess it’s not a common problem. But it sure would make my life easier if I could build an event driven graphics library on top of Vulkan.

Oh, and it would be really nice if Vulkan would offer a way to signal a semaphor at each vblank. Than i could simply swap buffers when the vblank is at hand and still do useful stuff (process other events) until then. Kind of adaptive vsync for free. But I’m not sure if Vulkan or its Window System Interface is the right place for that.

The signals, events, fences and barriers should not be limited to just the render queues or only inside a render pass (that would be a grave mistake no matter which way you look at it)

User code can wait and signal events. I hope you can set user callbacks for signaled events (however which thread runs them?) or wait on multiple events or hook it into the OS-specific event queue/signaling mechanics (posix’ select or win32 wait for multiple objects). I think at least one is needed to avoid the thread per waited-on event. There has been no confirmation that I found on that though.

They shouldn’t all be converted to OS events in a typical application, some events will either be polled irregularly (for example to free the user-space buffer for a queued DMA) or be inter-queue synchronization (only starting a render step after the uniform data has been uploaded)

Waiting on vSync/vBlank will probably happen. So you can start the next renderPass to the window right as it happens and do the previous rendering to textures (shadow maps, forward/deferred renders etc.) before that.

I’m not so sure about that. Limiting events to within command buffers or command queues might allow the GPUs hardware scheduler to do most synchronization without talking back to the CPU. Reducing overhead. And for most things I don’t even want to babysit this from the CPU. I just want to submit a bunch of commands to the command queue and insert a sync point. Then the program can do something else and each time it looks for the next job (via WaitForMultipleObjects, poll, select, …) it can also monitor the sync point to see if the GPU is done. Allowing each fine grained event to talk back to the CPU and OS sounds like a drivers nightmare. I would be perfectly happy with a simple sync command I can submit to the command queue. When reached the GPU can poke the CPU and tell the OS to signal an event object, eventfd, whatever. Simply something I can put into the WaitForMultipleObjects, select or poll functions to monitor it alongside other event sources.

To me it seems like all the pieces are already there in Vulkan. Fine grained events within a command buffer for the GPU scheduler, coarse grained events with possible GPU to CPU signaling in the command queue. The only thing that’s missing is to expose an event object the rest of the OS APIs can work with. That’s why I voiced it here.

Event callbacks also sound kind of difficult to me. If I would be a driver programmer I would not like to execute a callback without knowing how long it takes and what it does to my environment. And from an application programmers perspective a callback pretty much always requires synchronization in itself, even if it’s executed within your own thread. You just don’t know in which state your program is when the callback is executed. Pretty much the same thing as with signal handlers on UNIX/Linux. I’m really happy they introduced signalfds in Linux so you can monitor for pending events like for everything else. Then when you react to the pending event/signal you know in which state your program actually is.

Waiting for vBlank never gave me much trouble (apart from some broken Linux desktops). But all the ways I know of block the calling thread. You can use timers to do some work until shortly before the vBlank but when you miss it you’ll wait for an entire frame. What I would like to see is a way to monitor for a vBlank event with OS mechanisms like WaitForMultipleObjects or poll. Then I can simply swap buffers during vBlank without blocking everything. If I miss the timing it’s my fault and the user might see a tear line at the top of the screen. But then it has always been the responsibility of the application itself to not block an event loop with long running event handlers. I wouldn’t be suprised if that is what nVidias adaptive vSync actually does under the hood.

[QUOTE=Stephan Soller;31198]I’m not so sure about that. Limiting events to within command buffers or command queues might allow the GPUs hardware scheduler to do most synchronization without talking back to the CPU. Reducing overhead. And for most things I don’t even want to babysit this from the CPU. I just want to submit a bunch of commands to the command queue and insert a sync point. Then the program can do something else and each time it looks for the next job (via WaitForMultipleObjects, poll, select, …) it can also monitor the sync point to see if the GPU is done. Allowing each fine grained event to talk back to the CPU and OS sounds like a drivers nightmare. I would be perfectly happy with a simple sync command I can submit to the command queue. When reached the GPU can poke the CPU and tell the OS to signal an event object, eventfd, whatever. Simply something I can put into the WaitForMultipleObjects, select or poll functions to monitor it alongside other event sources.

To me it seems like all the pieces are already there in Vulkan. Fine grained events within a command buffer for the GPU scheduler, coarse grained events with possible GPU to CPU signaling in the command queue. The only thing that’s missing is to expose an event object the rest of the OS APIs can work with. That’s why I voiced it here.
[/QUOTE]

Then perhaps different types of events where one kind can be waited on by the CPU or only in a GPU queue. (perhaps by a parameter when constructing the event to specify who is going to signal and wait on it)

True the only way I can see that working if vulkan exposes a vkHandleCallBacks function but that may as well be implemented entirely in user code :


while(true){
    vkEventWaitForAny(events, eventCount, &signaledEvent);

    if(signaledEvent!=0){
        callbacks[signaledEvent](userData[signaledEvent]);
        swap(events[signaledEvent], events[eventCount]);
        swap(callbacks[signaledEvent], callbacks[eventCount]);
        swap(userData[signaledEvent], userData[eventCount]);
        --eventCount;
    }
    else
    {
        vkEventReset(events[0]);
    }
    //TODO add thread safety to add events and trigger events[0] when adding event
}

[QUOTE=Stephan Soller;31198]

Waiting for vBlank never gave me much trouble (apart from some broken Linux desktops). But all the ways I know of block the calling thread. You can use timers to do some work until shortly before the vBlank but when you miss it you’ll wait for an entire frame. What I would like to see is a way to monitor for a vBlank event with OS mechanisms like WaitForMultipleObjects or poll. Then I can simply swap buffers during vBlank without blocking everything. If I miss the timing it’s my fault and the user might see a tear line at the top of the screen. But then it has always been the responsibility of the application itself to not block an event loop with long running event handlers. I wouldn’t be suprised if that is what nVidias adaptive vSync actually does under the hood.[/QUOTE]

I meant waiting on vBlank in the queue, that way you can queue up a buffer swap and as soon as the event happens the buffer swap happens

Good luck with “Vulkan”…LOL

Thats pretty much what I thought the difference between events and semaphors is about in Vulkan. Events = fine grained GPU only synchronization. Semaphors = coarse grained synchronization the OS can work with. But that’s only hopeful guesswork on my part. I hoped someone from the working group could clarify this.

This is just a way to make an event loop execute a callback for each incoming event. And like you said this can be done by the programmers themselfs, no need to have such a thing in Vulkan.

But such a function would defeat the point. It would only wait for events from Vulkan. It wouldn’t be possible to include other HANDLEs or file descriptors into the waiting process. I couldn’t use that code to also wait for incoming video or audio buffers or network packets. So even such a simple example would require to isolate Vulkan into it’s own thread, shuffle buffers to and from that thread with some queues and do some polling there to see if new buffers or Vulkan events are pending. Just because I can’t integrate it into an OS event loop that uses MsgWaitForMultipleObjects(…) or poll(…). That’s pretty much like it is today with OpenGL.

But all that would evaporate if I could get a waitable HANDLE or file descriptor out of a Vulkan semaphor. Something like in that pseudo code:

VK_SEMAPHOR sem = { ... };
vkQueueSubmit(queue, 1, commandBuffers, fence);
vkQueueSignalSemaphore(queue, &sem);

// Linux
int fd = 0;
vkSemaphoreOSHandle(&sem, &fd, sizeof(fd));
// use fd in a poll(...), select(...), epoll(...) event loop

// Windows
HANDLE event;
vkSemaphoreOSHandle(&sem, &event, sizeof(event));
// use event in a WaitForMultipleObjects(...), MsgWaitForMultipleObjects(...), etc. event loop

The vkSemaphoreOSHandle(…) function would return the underlying OS object for the semaphor. And that OS object can be used in the operating systems event multiplexing functions. Of course the vkQueueSubmit(…) and vkQueueSignalSemaphore(…) calls would usually be within the event loop, not before it. I hope this clarifies what I’m looking for.

[QUOTE=Stephan Soller;31203]Thats pretty much what I thought the difference between events and semaphors is about in Vulkan. Events = fine grained GPU only synchronization. Semaphors = coarse grained synchronization the OS can work with. But that’s only hopeful guesswork on my part. I hoped someone from the working group could clarify this.
[/QUOTE]

You are going to have 3 variations of events at least:

signaled by CPU and waited on by GPU (waiting on the cpu side SSBO buffer to be filled before DMA)

signaled by GPU and waited on by CPU (waiting on a finished DMA to (re)use the CPU-side buffer)

signaled by GPU and waited on by GPU (waiting on a renderpass to complete and the target texture to become readable to use the resulting shadowmap)

That is assuming no hybrid event are allowed

[QUOTE=Stephan Soller;31203]
This is just a way to make an event loop execute a callback for each incoming event. And like you said this can be done by the programmers themselfs, no need to have such a thing in Vulkan.

But such a function would defeat the point. It would only wait for events from Vulkan. It wouldn’t be possible to include other HANDLEs or file descriptors into the waiting process. I couldn’t use that code to also wait for incoming video or audio buffers or network packets. So even such a simple example would require to isolate Vulkan into it’s own thread, shuffle buffers to and from that thread with some queues and do some polling there to see if new buffers or Vulkan events are pending. Just because I can’t integrate it into an OS event loop that uses MsgWaitForMultipleObjects(…) or poll(…). That’s pretty much like it is today with OpenGL.

But all that would evaporate if I could get a waitable HANDLE or file descriptor out of a Vulkan semaphor. Something like in that pseudo code:

VK_SEMAPHOR sem = { ... };
vkQueueSubmit(queue, 1, commandBuffers, fence);
vkQueueSignalSemaphore(queue, &sem);

// Linux
int fd = 0;
vkSemaphoreOSHandle(&sem, &fd, sizeof(fd));
// use fd in a poll(...), select(...), epoll(...) event loop

// Windows
HANDLE event;
vkSemaphoreOSHandle(&sem, &event, sizeof(event));
// use event in a WaitForMultipleObjects(...), MsgWaitForMultipleObjects(...), etc. event loop

The vkSemaphoreOSHandle(…) function would return the underlying OS object for the semaphor. And that OS object can be used in the operating systems event multiplexing functions. Of course the vkQueueSubmit(…) and vkQueueSignalSemaphore(…) calls would usually be within the event loop, not before it. I hope this clarifies what I’m looking for.[/QUOTE]

The callback in my code could be used to signal the OS-level event in the passed in void* userdata however that would introduce some extra delay and tie up a mostly-idle thread however it removes the OS specific event handling from the vulkan spec. I can see that going either way.

Frankly I’m don’t see how any more guesswork will help the working group. I posted my feedback to Vulkan and tried to clarify it. As Vulkan is finalized more details about that will be announced or someone from the working group will write about it. Until then there’s no need for further discussion about it.

While Vulkan is not Mantle, it does derive from there. So I found this presentation on Battlefield 4’s Mantle implementation that mentions Semaphores: They seem to regard semaphores as being ways to synchronize activity between multiple queues.

Also, using this list of Mantle’s API functions as a guide, there does not appear to be a way to get the status of a semaphore to see if it’s signaled. Whereas there is such an API for Mantle events.

So I don’t think the idea is that you use semaphores for GPU/CPU synchronization.

We will probably learn more about Vulkan when AMD publicly releases their programming guide to Mantle later this month. Assuming they go ahead with that.

And what happens if you want your code to actually be cross-platform? Maybe you’re using C11 standard library threads or C++11 standard library threads. Or whatever.

Vulkan is supposed to be cross-platform. The main API shouldn’t have these platform-specific function calls in it.

Why would you need to do that? Wouldn’t you just refrain from submitting the DMA request until the buffer was filled?

[QUOTE=Alfonse Reinheart;31206]While Vulkan is not Mantle, it does derive from there. So I found this presentation on Battlefield 4’s Mantle implementation that mentions Semaphores: They seem to regard semaphores as being ways to synchronize activity between multiple queues.

Also, using this list of Mantle’s API functions as a guide, there does not appear to be a way to get the status of a semaphore to see if it’s signaled. Whereas there is such an API for Mantle events.

So I don’t think the idea is that you use semaphores for GPU/CPU synchronization.

We will probably learn more about Vulkan when AMD publicly releases their programming guide to Mantle later this month. Assuming they go ahead with that.
[/QUOTE]

AMD already released the mantle guide + whitepaper

[QUOTE=Alfonse Reinheart;31206]

And what happens if you want your code to actually be cross-platform? Maybe you’re using C11 standard library threads or C++11 standard library threads. Or whatever.

Vulkan is supposed to be cross-platform. The main API shouldn’t have these platform-specific function calls in it.

Why would you need to do that? Wouldn’t you just refrain from submitting the DMA request until the buffer was filled?[/QUOTE]

True, and if you really want to hold a queue up inside a command buffer you can submit a buffer with just a gpu to gpu event to another queue.

from the synchronization section of the mantle spec

I can see 3 explicit synchronization objects:

Fences: only for end of command buffer signal to the cpu. Can be inserted with a dummy null buffer if needed.

Events: for setting/resetting inside command buffers, more expensive to test against

Semaphores: for inter-queue synchronization. Application is responsible for avoiding deadlock, debug layer can detect long-waiting queue.

Thanks for the reply and links! The presentation and function list were quite interesting. There’s a grSignalQueueSemaphore() function but the presentation mentions only queue synchronization with semaphors. I’ll just wait for the Mantle programming guide to show up.

I agree with you about platform independence. But then I usually work in areas where platform independence is little more than a pipe dream (at least for more complex things). Maybe someday as an extension. Maybe allowing to use Vulkan outside of a typical render loop architecture.

Edit: Just looked through the Mantle API programming guide and it doesn’t say anything about OS handles. That pretty much settles it for me. Thanks for the replys.