Memory Management : One buffer

Hello

In this article https://developer.nvidia.com/vulkan-memory-management , Nvidia advises us to use one unique buffer for vertices, indices and uniform (or other).

But I don’t know how to use them because I don’t understand how can I make them use cache efficiently.

It would be really ineficient to interleave one vertex, one indice, and one uniform value.
So, I can interleave “some of them”.
For example :

U, V1,V2,V3,…VN,I1,I2,I3,…,IN, VN+1, …

Or Buffer1 : U, V1,V2,V3,…VN,I1,I2,I3,…,IN,
Buffer2: U, V1,…VM,I1,…,IM, (but AFAIK, it is not the good solution because we want to avoid multiple buffer)

It seems good to me, but I don’t see how it could be “more cache friendly” than to have 3 buffers.

Thanks for your clarifications :).

Antoine

Another “interpret NVIDIA blog article” :rolleyes:. Actually this one was answered on StackOverflow (but unfortunatelly deleted recently).

I am not entirelly convinced myself. It should be useful when you have lots of (small) buffer objects. Which sounds a bit like a different problem with the App architecture (but I don’t feel I have the experience to judge, so just ignore that…).

Now, they do not propose any interleaving. To understand, just TL;DR the article and scroll to the end.

So consider the situation of having, I dunno, 10000000 small buffers.

Now, “the :doh:” figure shows naive implementation. You would allocate memory for each one. You would make Buffer object to each one and bind the memory to them. Then you would bind each one to the command buffer one by one. Obviously that’s not in any way cache friendly.

See “the bad” figure. Now to fix the cache friendlines of the data itself (likely in the GPU memory), we would manage the device memory allocation in such a way, that we would allocate just once a big sequential chunk of memory, that can accomodate the data of all the buffers. So that could help the cache efficiency on the GPU side.

See “the good” figure. We still have all these VkBuffer objects (assumably on the CPU side, with meta-stuff like the size and usage flags and whatever). It turns out we can just make one “virtual” VkBuffer for all of them. They would have only one set of meta-info (saving CPU side memory space and so helping cache). You would access the “real” buffers when binding to the command buffer by using offset into that virtual buffer. (You still need to bind them one by one though, but you can have all the offsets in an array or struct also possibly helping CPU cache efficiency.)

Anyway, that’s my interpretation of that article. Only the author can give authoritative answer to this.

Why do you want to put uniform data directly in front of vertex data? Uniforms are usually per-instance data, while vertices are per-object. Your way would make it difficult to render more than one instance of any particular mesh.

Furthermore, UBOs have specific alignment requirements, which are usually more restrictive than those for vertex data.

When NVIDIA says to use one buffer, what they’re saying is that you have your vertex data in one region of the buffer, your index data in a separate region, and your uniform/SSBO data in a third region.

I don’t see how it could be “more cache friendly” than to have 3 buffers.

To be honest, NVIDIA is really reaching with that justification. But the cache in question they are referring to is the CPU cache, not the GPU’s caches.

What they’re talking about is the CPU-side data that a VkBuffer object represents. If you only have one VkBuffer, then your rendering process will be using it all the time. So it’ll be loaded into the CPU cache early and stay there throughout the rendering process. Whereas if you have one VkBuffer per object, then you’ll have a lot of CPU cache misses, as each VkBuffer’s data lies in a different place.

Personally, I think a more reasonable bit of advice would be to not have the number of VkBuffer objects be based on the number of meshes. That is, you don’t give each mesh its own buffer. You might have a VkBuffer for all vertex/index data, one for uniforms, etc. You might have one VkBuffer for frequently changing vertex data if you’re doing some kind of streaming computation from the CPU. And so forth.

But the number of buffers should not grow with the size of the scene. The amount of memory behind those buffers can, but not the number of buffers themselves.

Hello,
Thanks for all your answers :).

I didn’t understand that it was about CPU cache and not GPU cache…

So, what you are advising me is to have “two big buffers” like the “bad situation” which owns vertices and indices for static geometry.
But for dynamic on the CPU geometry, it is better to have a big buffer (maybe in “host visible memory”?) like the “good situation” and bind them with a different offset several times.

Am I correct?

Thank you for you both :).

Antoine

I would advise getting on with creating your app and only reading such articles when you really really need that last 2 % of performance. :lol: Too bad the article does not list any measurements though…

But some things naturally fall nicely to that technique without even thinking of performance. Having Index+Vertex+Uniform in a single object called e.g. ball might look nice, instead of three objects like ball_indices, ball_vertices and whatnot. Or maybe having more small meshes there, like named plants with each mesh being different kind of plant or something and having like 100 different kinds there.

No. I’m explaining what they’re talking about. I’m telling you what their specific concerns are, and the ways in which they manifest.

I’m telling you that because I want you to think for yourself. Not merely to read something and follow what it says, but to understand why it says what it says and to decide if that matters to you. Think about whether it’s helpful to you and your code structure or performance.

Dynamic geometry needs a different VkBuffer from static geometry because it will almost certainly need a different memory allocation from the one you use for static geometry.

Static geometry will generally be used with DEVICE_LOCAL memory. That’s the fastest memory available, but it’s usually the memory type that’s most difficult and costly to access from the CPU. But static geometry is static, so you DMA to it once and you’re done.

Dynamic geometry needs semi-frequent CPU updating. So it would work best with a memory type that has faster CPU access. Which is often not a type that is DEVICE_LOCAL.

And even that advice is merely a first-pass oversimplification of real-world hardware. Intel hardware has exactly 1 memory type, for example. So on their hardware, static and dynamic geometry could live in the same allocation and thus the same VkBuffer. Some hardware may have fast-enough access to DEVICE_LOCAL memory (despite providing multiple types), such that static and dynamic allocations could indeed come from the same place.

The most important thing for you to keep in mind is this:

NVIDIA does not care if your program runs fast on other people’s hardware.

Maybe you don’t care either. Maybe you’re fine with writing NVIDIA-only or NVIDIA-optimized programs. But at the end of the day, you should not treat NVIDIA’s word on optimization as the final word for all implementations.

Getting maximum performance across hardware is never going to be as simple as following a couple of rules posted in some document.

Yes I understand…

I guess I will someday use my galaxy S7 (which support Vulkan) to test my renderer.
With that, I will try to achieve better performances for both my nvidia and my smartphone. (profilling is the better way I guess :D)
Unfortunately, I don’t have any amd hardware. Maybe later, when I will have a job, I will solve this problem ^^.

Thanks for your advises Alfonse.

Antoine :).

[QUOTE=Alfonse Reinheart;41199]
Personally, I think a more reasonable bit of advice would be to not have the number of VkBuffer objects be based on the number of meshes. That is, you don’t give each mesh its own buffer. You might have a VkBuffer for all vertex/index data, one for uniforms, etc. You might have one VkBuffer for frequently changing vertex data if you’re doing some kind of streaming computation from the CPU. And so forth.

But the number of buffers should not grow with the size of the scene. The amount of memory behind those buffers can, but not the number of buffers themselves.[/QUOTE]

As author of the article, want to confirm that the above was the intention.
Don’t throw tons of api “objects” around (memory, buffers…). The image was not supposed to be taken literally as “one memory/ buffer of a kind only”, just being smart about it.

Just because vulkan is “cheaper” on CPU side, still good to be conservative with api objects, where it makes sense. Similar to how people use their own malloc/free wrappers.

The advice was focused on CPU costs, which would be universal to all implementations. Memory objects would petty much map to OS managed resources, whose costs are vendor independent.

Again as Alfonse mentioned, don’t take things literally but keep in mind what makes sense for you use-case.
And sorry for failing to make the point more clear that it was about saving thousands of api objects not just a few.

When reading the article Summary, with The Good, Bad, and Expletive, should ‘Index’ ‘Vertex’ and ‘Uniform’ be read as ‘Indices’ ‘Vertices’ and ‘Uniforms’ in all three cases (or perhaps none)?

(I hope this is not considered necro-bumping - I still believe this information is valid)

Yea, something like “Index Data”, “Vertex data”, “Uniform data”. “Buffer” means one VkBuffer object. “Memory Allocation” means one VkDeviceMemory object.

They want to imply that you not only can pool memory. You can as well just create less VkBuffer objects, and instead use byte offset into the buffer when doing vkCmdBind*Buffers.

Appreciated. Thank you for the explanation.

Makes sense. I imagine in “most” cases, offsets into one large buffer would be advisable.

Probably does nothing in most cases. But it is free and mostly effortless to do.

Arguably might be troublesome on some implementations though, as it requires more usage flags on the VkBuffer. It violates the principle of least requirement in Vulkan. Better organization might be to stuff all vertex data of everything in one buffer. Index data of everything in another buffer. Etc. Depends on your architectural needs.

The general lesson is simply not to create objects unless you really need to. Usually not a problem if you already have less than 100 Vulkan objects.

1 Like