Memory type practice for an MVP uniform buffer

I found that there are typically three types of memories:


I don’t have a device to test all of them, so I can’t profile to tell which is the best for a uniform buffer.
My guess is that 1 > 2 >3, since device local is good for GPU to read, and memory type of 2 needs to flush manually, which may slower than memory type 1?

Is my guess right?

Lots of non-integrated GPUs have device-local memory that is not host-visible.

This is the wrong question to ask at the wrong time. The first question to ask is if that memory type even supports a UBO at all. Because they don’t have to. The second question is how frequently you’re going to be changing its data and therefore how you want to go about doing so (staging vs. mapping, etc).

Only once you’ve answered those questions can you start to ask whether a particular memory type is going to be good for your use case.

If there was a memroy type best for a uniform buffers, there would be a VK_MEMORY_TYPE_BEST_FOR_UNIFORM_BUFFERS flag. Everything depends on actual usage, and developer discretion and needs.

non-DEVICE_LOCAL memory resides on the Host. It is basically a RAM abstracted away, so it is allocated the way Vulkan likes it. For access on GPU the driver has to stream it over the bus. Or you can use it to copy things to actual DEVICE_LOCAL memory, which is recommended for device accesses.

DEVICE_LOCAL | HOST_VISIBLE is recommended by vendors for CPU→GPU data flow. This memory is optional, and may not exist, and it is also more limited by size.

HOST_CACHED is recommended for GPU→CPU flow. It includes cache on the host, improving host reads.

  1. If a memory type doesn’t support UBO at all, of-cause, I won’t use it. Let’s say there’s a device with these 3 types, and all of them support UBO.
  2. The UBO is used for MVP, so it is updated every frame. Let’s say we don’t use pushconstants here.

One key information I missed is that the UBO is used for MVP updating (every frame), Let’s say we don’t use pushconstants here.

Hi, let’s say you are writing a renderer, and you don’t know what device your code will run on, and you need to find out all memory heaps from this device that support an UBO for MVP updating every frame(Let’s say we don’t use pushconstants here.), and you need a rank to choose which one to use from all supported memory types.

That’s the real problem I met and why I asked. I collect lots of memory types from the, and I need to make a decision which one to use (this may just a guess, because we can’t profile) when all the three memory types are supported. If you were me, how do you make this decision? Or, if you were me, do you need more information to make the decision?

Performance decisions cannot reasonably be made in a vacuum. If performance is going to matter, you’re going to have to make those decisions based on the totality of what you’re rendering, not just small fragments of it.

In a vacuum, if you’re doing streaming work for transformation matrices, I would suggest prioritizing host-visible memory. That is, first check to see if there is host-visible&device-local memory for UBO use; if not, look for host-visible only. And if that’s not available, then move on to staging through device-local memory.

However, that is only in a vacuum. See, if you’re doing a lot of streaming of vertex data or skinned mesh matrices, then you’ve got a problem. Namely that many GPUs only have a relatively small amount of memory that is both host-visible and device-local. Vertex data is more likely to be the bottleneck in rendering applications compared to a UBO, so streaming vertex data should be given priority access to the limited streaming memory storage.

But even that is only to the extent that you’re actually likely to run out of this memory. The smallest you see with these streaming memory types is around 250MB. That’s a lot to be generating on the CPU and shoving at the GPU. So it may not be a problem.

But it may actually be a problem, depending on what you’re doing. Which is the point: if there was one right answer, there wouldn’t be a point in letting you choose what works for you.

1 Like

It is not really only a simple choice of memory. Whichever memory you choose also impacts how you need to treat it afterwards, so it is not as simple as assigning rank and be done with it. Non-visible needs copies or command buffer writes. Non-coherent needs explicit coherency handling. UMA memory you should treat as zero copy to get the benefits. The non-UMA DEVICE+VISIBLE memory needs to be treated as exhaustible resource.

Likely you need a codepath that works everywhere, so you can use that as a baseline for measuring against. And then do the case-by-case optimizations.

1 Like

DEVICE_LOCAL means the memory is likely faster to use by the GPU, for example as a uniform buffer.

The lack of HOST_CACHED usually means that the memory is uncached and write-combined, which means you should only write to it sequentially or do memcpy, because random accesses and any reads from it are super slow. This is very important.

I would choose 1 or 2. Hard to say which one is faster without profiling on a specific platform.

1 Like