Memory import size truncated on Windows

Hi!
I have posted on NVIDIA forums about this, but wanted to ask here as well: is the memory size for vkAllocateMemory forcibly truncated to 32 bits on Windows, and if so, why?

We use Vulkan for processing large amounts of data for scientific applications. The data sizes are usually much more than the VRAM size. Usually we work with systems that have 64-128 Gb of RAM and 8-12 Gb of VRAM, and the data volume for processing is usually about 32-64 Gb. Some algorithms require all the data to be accessible to the GPU at the same time. Obviously, such access is done via PCIe bus, which is not very fast, but it is acceptable for our purposes.

To achieve such behavior in CUDA we formerly used pinned memory and everything worked fine. Now we have moved to Vulkan and tried more or less the same approach - allocating aligned memory in RAM by OS functions and then using VK_KHR_external_memory extension to make it accessible to the GPU via the PCIe bus. To access this memory from the compute shader we use uint64_t addresses (provided by VK_KHR_buffer_device_address extension), so the buffer range limitations are not a problem for us.

On Linux this approach actually works fine and we can access any amount of memory up to the total amount of RAM available, although the validation layers give the following message:

vkAllocateMemory(): pAllocateInfo->allocationSize (7516192768) is larger than maxMemoryAllocationSize (4292870144). While this might work locally on your machine, there are many external factors each platform has that is used to determine this limit. You should receive VK_ERROR_OUT_OF_DEVICE_MEMORY from this call, but even if you do not, it is highly advised from all hardware vendors to not ignore this limit.

It is our opinion that, logically, the allocation size limit should not apply in this case (that is, the validation layers should not report anything), since we’ve already allocated the memory, and want to merely map it for access from the GPU.

On Windows we observed a rather strange behavior. A maximum of 4 Gb memory can be imported, and the allocation size seems to be truncated to 32 bits. For example, if 7Gb of memory is allocated and then imported, the vkAllocateMemory function called for memory import returns no error. However, only 3 Gb can then be accessed from the GPU. If 5 Gb of memory is allocated, only 1 Gb can be accessed, etc. That makes us think only the lower 32 bits of the requested buffer size are taken for import size, and the upper 32 bits are ignored, despite the type for allocationSize being uint64_t. It seems unrelated to maxMemoryAllocationSize in our opinion (although maxMemoryAllocationSize is also 4 Gb in our case). We are aware that on Windows only 50% of RAM can be mapped to the GPU, and the amount of memory we’re trying to allocate for our tests does not exceed this limit, so the 50% limitation is not related to our problem either.

On Linux (Ubuntu 22.04) we use NVIDIA 570 proprietary drivers. On Windows we use the latest Game drivers installed by NVIDIA Center (as for 29.04.2025, version 576.02). Are you aware what’s the reason for this behavior? Can anything be done about it?

1 Like

That limitation has no exceptions for imported memory. It’s effectively saying that the device is not capable of dealing with a single, contiguous, GPU-accessible region of memory beyond this size.

If you violate that limit, then you’re in undefined behavior land.

If the implementation doesn’t allow you to make a single contiguous, GPU-accessible allocation of that size, then you can’t do it. You’ll have to use multiple import allocations that are smaller and rearrange your code accordingly.

I’m also interested in this question. The limitation seems to be a synthetic one. May be it will be profitable to get rid of it?

From the topic it follows that it technically works on the same hardware on Linux (it is only validation layer who is not happy). From my personal experience I know that such approach works on Cuda really well both on Linux and Windows. Also 4Gb limitation of buffer capacity is a legacy from OpenGL GLSL world with its 32 bit indexing. But here there is modern 64 bit memory addressing. So the limitations has no technical background underneath…

What do you mean by “synthetic”? The limitation exists because some implementations and/or platforms won’t let you do it. Deciding that the limitation is “synthetic” won’t suddenly allow those implementations/platforms to let you do it.

For example, the limitation on the number of memory allocations exists mostly because Windows will not let an application make more than 4k concurrent GPU allocations. The platform has decided that making that many allocations represents pathological behavior and the platform won’t let you do it. But other platforms will allow for more allocations.

The Vulkan standard, and even the implementations themselves, have no power to change what platforms will and will not permit. The most they can do is expose those limitations to you where possible.

So whether the limitation is “synthetic” by some definition is irrelevant; what matters is whether the limit exists. And clearly, it does.

That’s not to say that this limitation necessarily comes from Windows. I don’t know where it comes from. But it does in fact exist and therefore must be respected.

By “synthetic” limitations, I mean limitations that have no hardware/physical/logical background. They exist only because of some software legacy solutions/bugs/developer’s lazyness/etc. I really doubt that it is a Windows limitation. I even doubt that it is a VulkanSDK limitation. Will try to explain.

I have made tests on my system having Intel UHD 630 integrated card and nVidia RTX 2070 dedicated card. Let’s see how nVidia dedicated card works under Linux / Windows using Cuda / Vulkan. I use nVidia proprietary 570 driver on Linux and the latest nVidia Game driver on Widnows.

  1. Linux & Cuda. You can map any amount of RAM to GPU. Even 10 Tb :slight_smile: It will try to actually allocate physical memory at the moment of memory page access from GPU/CPU side. Very convenient.

  2. Linux & Vulkan. There is a maxMemoryAllocationSize of about 4Gb. But in reality, you can easily map all available RAM (not more) to GPU. And work with it using uint64_t addressing in shader. Only validation Layer complains.

  3. Window & Cuda. You can map maximum 0.5 * RAM available. I have a long discussion with Microsocft & nVidia about this cavity. The result - Microsoft will never give to nVidia access to some kernel functions, so this is pure Microsoft limitation. nVidia can do nothing with this.

  4. No we go to our topic. From examples above, it’s clear that dedicated nVidia card hardware is able to work with any amount of RAM mapped to GPU. Now it is time to understand who is responsible for 4 Gb limitation - VulkanSDK or nVidia driver. I have dumped maxMemoryAllocationSize from both integrated and dedicated cards. Here are the results:

  • NVIDIA GeForce RTX 2070 with Max-Q Design: 3.99805 Gb
  • Intel(R) UHD Graphics 630: 23.3927 Gb

So integrated card can allocate much more then 4Gb meaning that VulkanSDK does not restrict it. So 4 Gb limitation goes from nVidia driver solely. Am I right here?

From kind of behavior (the address wraps by 4 Gb) it seems that somewhere in the driver the buffer size variable is just defined as uint32_t, not uint64_t. Then it was simpler for them to set the maxMemoryAllocationSize to 4 Gb then to fix the code. Keep in mind that in Linux proprietary driver they succeed in making it proper.

I see that @CoffeeExterminator has already reported this limitation (I would even call it bug) to nVidia. If I’m right I will also join that discussion. If Khronos can have some influence on nVidia - will be perfect to push them.

P.S. I and my team have been trying to go from Cuda to Vulkan for already more than 2 years. We succeed a lot, but it is a pity to see quite a lot of synthetic limitations of Vulkan in comparison to Cuda on the same hardware. I could ever write an article about that. :slight_smile: Probably I would do this later.

Oh, okay. I thought it was the limit on the allocation itself, not generally dealing with the memory. Thanks for the explanation.

Our tests match what you have described, and I agree with your conclusions, it seems like an NVIDIA driver issue (both with limiting maxMemoryAllocationSize to 4 Gb and truncating the allocation size). I haven’t got any reply on NVIDIA forums so far (unfortunately I can’t include a link here), so I guess now we wait and see what they have to say.

And yet, from the fact that the driver explicitly said “I can’t handle allocations higher than 4GB”, it’s clear that this wasn’t an accident. It’s not a typo, it’s not from them using the wrong type internally. Some programmer at NVIDIA (likely multiple) explicitly made the decision to impose this limitation.

The only people who know why it’s there are at NVIDIA. But it’s clearly not an accident or a bug; it’s deliberate design. Maybe not good design, but it was a choice.

Hello again!
I’m not sure if it is a different problem (and therefore should be its own topic) or something related.

I decided to do some more research and test memory allocation and import on my integrated GPU, which has maxMemoryAllocationSize of 12771291136 (so, a little less than 12 Gb). I got the following results:

On Windows only up to a little less than 4Gb of memory can be allocated or imported, allocating 4 Gb or more results in the out of memory error.

On Linux any amount of memory up to maxMemoryAllocationSize can be allocated and accessed normally. However, if I attempt to allocate memory with an OS function and then import it, I get the address truncation problem again: if I import 5 Gb of memory, only 1 Gb can be accessed from the shader, etc.

I’ve run all tests on Intel(R) Iris(R) Xe Graphics card. On Windows the driver version is 30.0.101.1960. On Linux I have the standard Intel drivers for Ubuntu 22.04 (not sure if they have their own versioning).

I’m really at a loss, any ideas as to what’s going on here? We do not normally work with integrated cards, so I could be missing something.