Official Vulkan Feedback: API for High-efficiency Graphics and Compute on GPUs

Conversation continued here.

[QUOTE=ratchet freak;31094]multithreading and its pitfalls is not for beginners, Vulkan apps can still be single threaded but buffer interraction must be synchronized with the GPU explicitly.

telling that to beginners that the upload happens asynchronously and the pointer must stay alive until the fence has been triggered will lead to confusion unless they have previous experience with async IO.[/QUOTE]

I’ve dealt with it to an extent in my classes, but I am no means an expert on it. So I call myself a beginner. I do know what you mean though, this is an advanced topic.

Wow, thanks for the long response. I didn’t mean to sound argumentative or anything. I do indeed work step by step like everyone else, but I like to look ahead and trace down to the bottom when I need to if that makes sense. If anything its the challenge aspect and me being stubborn, it keeps me focused and the engine turning. If I get bored I will just wander off and do something else, heh.

But yeah, I do agree Khronos shouldn’t be spending time catering to beginners. I might have misread to an extent. They should be focusing on what’s important at this stage, I would imagine, and stuff like input libraries like you said isn’t all that relevant or useful for what Vulkan is trying to accomplish (from what I understand).

That’s kind of why I wanted to just jump into Vulkan. I don’t see a reason to become an expert in OpenGL when Vulkan has an equivalent for everything. I mean, if I can’t get access to the API then I guess I’ll just work with OpenGL. Not much I can do about that.

[QUOTE=Alfonse Reinheart;31085]Furthermore, if forward compatibility is your concern, then detailed specs aren’t helpful to you. Consider a world where TBRs never existed. Then suddenly, someone comes out with one. Well, Vulkan’s API would have no way to tell you that it’s a TBR, and therefore you will assume there’s a problem because you see terrible write bandwidth. But TBRs don’t need huge write bandwidth, by their very nature. To fully understand the value, you would need to interpret the specification differently. But there’s no way to codify the notion of TBR in that API; you’d need some kind of extension, and you would need to radically change every application that uses this spec data.

At least by doing it Vulkan’s way, they have a single, extensible value that represents a particular kind of renderer. If a new one shows up, then it uses a new value, and developers will use a fall-back case until they learn how to do the right thing.

Remember: premature optimization is the root of all evil. And the only possible use for the kind of information you’re talking about is premature optimization. So I would say that the best thing you can do is continue to write code based on empirical evidence.[/QUOTE]

Alfonse, I appreciate your feedback but I think I implied too much. I am in no way suggesting that this would completely remove the need for specific hardware work and imply implicit trust of the provided specs. I am not suggesting exact stall times, bandwidth, flops, etc, be provided. Simply broad strokes that remove the need for some of the most obvious optimization techniques either driver devs or us have been doing by hand for years.

A driver could provide a general order of preference for state changes. It could provide a general cost comparison of different types of sampling on the same chip. This isn’t specs, it’s “best behavior.”

One of the advantages that OpenGL does have over Vulkan is it allows the driver to make more informed decisions from time to time since, typically, hardware vendors know the most about the metal. By providing “behavioral preferences” in Vulkan, a driver team can pass on some of this implicit knowledge to the developer.

Consider this scenario: Internally, a GPU sets up a standard pipeline by setting up vertex stream behavior and then applying those stream characteristics onto a prefix spec that uses the stream format for the current vertex shader that’s on chip when this changes. In this case, it would always be advantageous to apply the shader code before vertex stream formats to avoid stream specs being doubly applied. This is special knowledge. A driver team knows it, a game dev does not. A simple behavioral preference provided by the Vulkan API would be easy, simple, and efficient to provide.

Response is here.

That means please stop replying in this thread, which is for feedback on Vulkan :wink:

What I hope will happen is that AAA game devs wil actually comply with the spec, that rant on gamedev.net and prior research tells me that they sometimes games ship barely bothering complying to spec.

Reeks of “works on my machine” and giving everyone in the shop the same high-powered machine leaving it to the driver guys to pick up the slack and bloating the driver beyond the needed. Having a proper validation layer will hopefully bring more pressure on them to stay within spec. TBH if I was a producer or lead I would enforce that no game ships until it passes the entire validation.

From:
https://drive.google.com/file/d/0B-MryQk4ewrRMVlVdjRGeWFiZXM/view?pli=1
Page 33.

Bindless

  • Debatable need – descriptor sets can be of arbitrary size
  • Explicit memory residency already in API

You are implying you can make the API faster and are not going to do it because it’s already fast enough?!?
IF The API can be made faster, avoid unnecessary CPU-GPU roundtrips then implement this functionality from the start where applicable or equivalent or better.

And DSA style API if, where applicable or equivalent or better too .

Also don’t forget to have API functionality for applications requesting if a specific version of Vulkan and SPIR-V is supported by a driver and or ask which versions of Vulkan and SPIR-V are supported by drivers.
Over time this will become more important. As new things are added to the api 5, 10 or even 25 years from now.

From: https://drive.google.com/file/d/0B-MryQk4ewrRMVlVdjRGeWFiZXM/view?pli=1
Page 33

Submit the same command buffer many times

  • Amortized cost of building command buffer literally approaches zero

Almost there, can’t I just keep the command buffer copy on the GPU’s memory and refer to executing it once every frame?
(Without having to resubmit the whole command buffer.)
If I don’t need to change It, I can just reuse that data, right?

First, that is an awesome link. I was thinking about writing a post on a blog somewhere about why we need APIs like Vulkan, but that post summed it up. It even made arguments based on information I didn’t know about.

Second, while Vulkan’s validation layer ought to help, I don’t think it’s going to stop developers, per se. When a project is running late, and you’ve been working 16 hour days for 7 weeks, are you really going to take the time to test your code under the slow Vulkan validation layer?

It’s not that developers don’t know what they’re doing. It’s about priorities. And when you’ve been awake for the last 23 hours, your biggest priority is shipping the game, not following the letter of some specification.

That being said, the validation layer ought to go a long way into at least catching some of the spec violations (they almost certainly won’t be able to catch errors like poking at memory via a mapped pointer while it’s been read). Beyond this, the best chance for conformance is the increasing engine-ification in the industry. A dedicated team of graphics developers can keep their engines valid by frequently testing it with their validation layer.

The other thing is that a Vulkan implementation simply won’t have the kind of information needed to make some of these behind-the-scenes fixes that often get made in modern D3D/OpenGL drivers. Since Vulkan command buffers are designed to be used in highly threaded cases, generating commands can’t start locking mutexes or something. So unless a validation layer or something is in use, the bare Vulkan driver can’t have inter-thread communication about which memory is in use. So it can’t correct for misuse of the API.

Well, unless Vulkan implementations are willing to slow down multi-threaded command buffer creation. And 60% of the point of Vulkan is to not do that.

Hi,

Since so few details are available it is somewhat hard to give concrete feedback. Instead I wanted to get in early as they say and ask what provisions there are for efficiently managing sparse or virtual textures. With the current OpenGL API for this there is a need to call a function for each and every square region that is to be committed/uncommitted. My feedback point is that I think this is a major flaw and incurs a lot of (needless) overhead.

With the direction of Vulkan being to move away from API overhead I’m hoping this is not going to be repeated! I think the ideal situation would be to be able to let the GPU itself build the data to feed mapping and unmapping calls, in a similar way that indirect draw call APIs work today. The reason I care is because of my virtual shadow map algorithm for many lights, but I think this use case will come up again and again as people start realize the power of sparse data. One more use case is simply hardware supported resolution matched shadow maps, which I think has genuine potential to finally replace the horrid cascaded shadow maps that are in fashion at the moment. In the end anything that generates sparse data on the GPU could be a candidate.

So, hopefully this is something that’s already taken care of and then you’re welcome to drop some reassuring hints right here :slight_smile:

Cheers

[QUOTE=Gedolo2;31107]From:
https://drive.google.com/file/d/0B-MryQk4ewrRMVlVdjRGeWFiZXM/view?pli=1
Page 33.

You are implying you can make the API faster and are not going to do it because it’s already fast enough?!?[/quote]

… OK, let me just skip to the end of this conversation and explain what they mean by “bindless”.

They are referring specifically to ARB_bindless_texture. This is the ability to take a texture object, convert it into a number, pass that number to the shader as though it were normal data, and then the shader takes that number and converts it back into a texture. Thus, the shader can use the texture without having to “bind” the texture. And therefore, the shader is using the texture bindlessly.

The Vulkan “equivalent” to binding a texture is loading up a new descriptor set. The point the slide is making is that, because descriptor sets have arbitrary sizes, you can make your descriptor have a gigantic array of textures. Indeed, it could have every texture you could ever use.

And since your shader can pick any texture in the descriptor set to use… you can just pass an array index (ie: a number) to your shader. And it can convert that array index into a texture to use. Therefore having the effect of bindless.

All you have to do is load up a descriptor set of all applicable textures. You do this once, and then every object you render uses that descriptor set. And thus, since the Vulkan “equivalent” to texture binding is loading up new descriptor sets, you’re able to have different objects use different textures without having to “bind” them. Thus “bindlessly”.

So you have all of the advantages of bindless without having a special API for it or throwing around texture handles and so forth.

[QUOTE=Gedolo2;31109]From: https://drive.google.com/file/d/0B-MryQk4ewrRMVlVdjRGeWFiZXM/view?pli=1
Page 33

Almost there, can’t I just keep the command buffer copy on the GPU’s memory and refer to executing it once every frame?[/quote]

Yes. That’s exactly what “Submit the same command buffer many times” means.

@ Alfonse-Reinheart
Thank you for clarifying bindless in the context of Vulkan.
Great to know Vulkan already has something better and will perform well.

Thanks for clarifying the question/request about submitting the same command buffer many times question too.
By submitting the same command buffer, it seems to say to me the command buffer gets copied from CPU to GPU memory every time.
You would have to add you don’t send the command buffer itself every time. Or say something like: execute the same command buffer in GPU memory many times or something like that.

The text seemed to imply descriptors are an addition to bindless, not a replacement.

I’m still a bit worried about the “Submit the same command buffer many times”,
what if you are misinterpreting Alfonse?

[QUOTE=Gedolo2;31114]Thanks for clarifying the question/request about submitting the same command buffer many times question too.
By submitting the same command buffer, it seems to say to me the command buffer gets copied from CPU to GPU memory every time.[/quote]

That’s an implementation detail. If command buffers live in CPU memory and are copied into the GPU’s queue, that’s probably because it’s the fastest way to operate on that hardware. If command buffers live in GPU memory and are DMA’d into the GPU’s queue, that’s probably because it’s the fastest way to operate on that hardware. If command buffers live in GPU memory and are referenced by the GPU’s queue, that’s probably because it’s the fastest way to operate on that hardware.

Don’t second guess the IHV; let them do the tiny job Vulkan still gives them :wink:

So glad you are sticking with a C format and not C++.

There are other languages out there (Delphi). A C++ interface would leave me out of the Vulkan revolution.

I’m not sure whether this counts as SPIR-V or Vulkan. I’ll put it here, as it relates specifically to graphics stuff.

Khronos is working on a GLSL-to-SPIR-V compiler. However, there are two features of GLSL that are not available in SPIR-V (presumably because Vulkan won’t support them either).

The first is the ability to define uniforms outside of uniform blocks. The second is shader subroutines.

So… what will the Khronos compiler do if it encounters a GLSL construct that SPIR-V has no direct analog for? Shader subroutines could theoretically be emulated. The problem is that emulating shader subroutines would require having “subroutine uniform” values at global scope. So that requires a solution to the naked uniform problem.

So for naked uniforms, would they be aggregated into a uniform block of some kind? In what order? What about the layout of the block?

Or would the compiler just fail, not being able to handle such uniforms?

The reference compiler is here: Reference Compiler

With naked uniform, do you mean this?

precision mediump float;

uniform vec4 color;

void main() {
    gl_FragColor = color;
}

I ran it through as a fragment shader. It produced this:

cheery@ruttunen:~/spirthon$ ./glslangValidator sample.shader.frag -H
sample.shader.frag



Linked fragment stage:



// Module Version 99
// Generated by (magic number): 51a00bb
// Id's are bound by 14

                              Source GLSL 100
               1:             ExtInstImport  "GLSL.std.450"
                              MemoryModel Logical GLSL450
                              EntryPoint Fragment 4
                              Name 4  "main"
                              Name 10  "gl_FragColor"
                              Name 12  "color"
                              Decorate 10(gl_FragColor) PrecisionMedium 
                              Decorate 10(gl_FragColor) Built-In FragColor
                              Decorate 12(color) PrecisionMedium 
               2:             TypeVoid
               3:             TypeFunction 2 
               7:             TypeFloat 32
               8:             TypeVector 7(float) 4
               9:             TypePointer Output 8(fvec4)
10(gl_FragColor):      9(ptr) Variable Output 
              11:             TypePointer UniformConstant 8(fvec4)
       12(color):     11(ptr) Variable UniformConstant 
         4(main):           2 Function NoControl 3
               5:             Label
              13:    8(fvec4) Load 12(color) 
                              Store 10(gl_FragColor) 13 
                              Branch 6
               6:             Label
                              Return
                              FunctionEnd

but the reference compiler doesn’t support subroutines yet.

I tried to follow the presentation and also looked at the slides afterwards. I am still not sure how I am supposed to imagine the command queue to work like. Currently i understand it the way that there can be multiple command queues that can be set off at different times from different threads and get handled by the GPU right away. Each would contain commands to for example draw a bunch of objects. Is it correct to assume that basically they would be separate small state machines each as compared to the one big monolithic state machine present in OpenGL 3.2+ Core Profile? How are things like blend states handled otherwise?

I am really looking forward to the first Hello-World-Coloured-Triangle Demo using Vulkan.

Also how about the possibility to combine OpenGL and Vulkan rendering inside an application. I am a developer for a small Open Source library (CEGUI) project and I wonder if it is possible to render to the framebuffer (of an OpenGL context) with a Vulkan Renderer, while the rest of the application itself uses OpenGL. Can these even be combined at all or should the entire application either use OpenGL OR Vulkan.

[QUOTE=cheery;31141]The reference compiler is here: Reference Compiler

With naked uniform, do you mean this?

precision mediump float;

uniform vec4 color;

void main() {
    gl_FragColor = color;
}

[/quote]

Yes.

Hmm… Well, that’s what it ought to look like. But the spec is unclear about whether this should be allowed.

I guess we’ll need them to resolve bug #1299 before we can know more about whether it is intended that this should be supported.

It’s important to understand the difference between a command queue and a command buffer.

Command buffers are what you put commands into. This is where you send your rendering pass, pipeline state, descriptor sets, and rendering commands. These are intended to be threaded on the CPU (though if you want to build them on the same thread, that’s fine too).

Command queues represent specific GPU FIFO processors for commands. Command buffers just store commands; to execute them, you have to send the command buffer to a command queue.

Generally speaking, executing a command buffer through a particular queue is probably not something that you should do concurrently. While issuing a command buffer may internally lock a mutex, the order that command buffers are submitted is very important and is something you will want to control.

So after your multiple threads build their various command buffers, one thread will go through and issue them to a particular GPU queue.

That’s probably a solid analogy, but like all analogies, they break down if you examine them too closely. It’s good enough for an overview.

They’re part of the pipeline state.

Nothing has been said about any form of interoperability of this kind. Given the very fundamental differences between the two APIs, interop would probably require some kind of hard synchronization point, like with OpenCL/OpenGL interop.

That being said, NVIDIA absolutely loves both their current OpenGL implementation and supporting older code. So if there was to be some kind of non-official interop for Vulkan, it would probably come from them.