VBO have 2x the memory footprint of DL

So I started looking into using VBOs and discovered that they have twice the memory footprint of equivalent display lists. When I say “memory footprint”, I’m talking of an explicit reduction in the total available virtual address space for the process. My test was very simple, but realistic none the less:

  1. Observe virtual address usage ~40MB
  2. glGenLists(32768)
  3. fill each list with 1024 3D GL_POINTS
  4. Observe virtual address usage ~450MB
  5. glGenBuffers(32768, aVBO)
  6. bind and fill each list with 1024 x 3 GLfloats
  7. Observe virtual address usage ~1.2GB

The DL usage makes sense. The VBO usage is ~2x what it should be. Has anyone else seen this problem?

Quadro FX 3450 256MB current driver VISTA OS

Did you actually delete the display lists?

Yes well if you have 32768 buffers i can understand why this is, however VBOs are not used this way, it could be that VBO’s are more top heavy when it comes to low polycounts, though it doesn’t matter as you often put several objects in a single VBO.
Secondly i seem to remember that the drivers keeps a backup copy in the ram of everything you put in vram(textures and VBOs), so it could be as korval said that you just didn’t delete something.

I think this is what happens:

  1. myVertices = new float[3 * vertexcount]; //memory from address A to B is being allocated by application
  2. send vertex data to driver //memory from address B to C is being allocated by driver
  3. delete[] myVertices; //memory from address A to B is released

Outcome: upper bound of virtual address space equals C, but only second half of that address space is used now.

It doesn’t matter. If I skip creating the DLs, the VBOs still take up twice as much memory.

There is no allocation of vertices in my test application. I’m simply copying a single static buffer into each VBO. The increase in memory is directly a result of that copy.

is the usage parameter set to static?

My application has thousands of complex components and yes I am indeed looking at per-VBO overhead vs per-DL overhead because that was mentioned in another thread. Regardless, I would not expect a single VBO to have 12K of overhead (i.e., 1024x3xsizeof(GLfloat)).

While there is a user space backup of both VBO data and DL data, I’m seeing what would amount to two copies of the VBO data.

Did you try to swap steps (2,3) with (5,6) and then measure again?

right, i’ve just tested this myself by switching my renderer to dlist mode. I can confirm that the VBO’s are now taking double the memory of dlists. This didn’t used to be the case - I used to get less memory usage with static VBO’s.
ForceWare version: 162.65
I’m not happy. This is going to kill us if it’s not fixed pronto.
(Edit) BTW, the scene consists of 287 batches, total of 50,000 triangles spread over a single 2mb static_draw VBO and a single 4mb static_draw IBO.

I started looking at larger buffer sizes and decided to skip display lists and focus only on VBOs. After running through several iterations of various buffer sizes, I found that if you create VBOs that are less than 65536 bytes in size (e.g., 65535 bytes or smaller), your process will take a 2x hit in committed memory relatvie to the actual size of the data. If your buffers are greater than or equal to 65536 bytes, the committed memory matches expectations. For float vertices, this comes out to be 5461 vertices (i.e., if your buffers have 5462 vertices or more, you won’t get penalized, anything less and you are). For example:

Given:

 
#define MAX_BUFFERS	4096

char p[65536] = {0};

This block of code:


UINT ab[MAX_BUFFERS];
glGenBuffers( MAX_BUFFERS, ab );
for (int i=0;i<MAX_BUFFERS;i++)
{
   glBindBuffer( GL_ARRAY_BUFFER_ARB, ab[i] );
   glBufferData( GL_ARRAY_BUFFER_ARB, 65535, p, GL_STATIC_DRAW_ARB );
}

results in twice as much committed memory as:

 
UINT bb[MAX_BUFFERS];
glGenBuffers( MAX_BUFFERS, bb );
for (int i=0;i<MAX_BUFFERS;i++)
{
   glBindBuffer( GL_ARRAY_BUFFER_ARB, bb[i] );
   glBufferData( GL_ARRAY_BUFFER_ARB, 65536, p, GL_STATIC_DRAW_ARB );
}

The only difference is the size of the buffer being created (65535 vs 65536). I measured the memory comsumption inline with the execution of the code and it is precise and repeatable. Also note that it’s not that each VBO is 64K because creating the same number of VBOs with a smaller size results in approximately twice as much committed memory as the composite size of the buffers. FWIW, the per-VBO overhead appears to be approximately 700 bytes.

I haven’t tested this on ATI systems yet.

Thanks for confirming this.

There is a definite jump when VBOs are less than 64K and I’m also seeing a rise in memory consumption when the VBOs are greater than 512K and less than 3MB. Some of this can be expected over time but not for a clean initialization. I haven’t run any numbers to find the exact breakpoints or look for any subsequent anomalies but your 2MB buffer would probably be affected.

This reminds me of the performance gap between 16bit addressable and 32bit addressable VBO’s on Nvidia cards but I guess I’m just stating the obvious here :slight_smile:

N.

Maybe this helps…

glBufferDataARB()

This function is an abstraction layer between the memory and the application. But behind each buffer object is a complex memory management system. Basically, the function does the following:

-Checks whether the size or usage type of the data store has changed.

-If the size is zero, frees the memory attached to this buffer object.

-If the size and storage type didn’t change and if this buffer isn’t used by the GPU, we’ll use it. Everything is already set up for use.

-On the other hand, if the GPU is using it or is about to use it, or if the storage type changed, we’ll have to request another chunk of memory for this buffer to be ready.

-If the data pointer isn’t NULL, we’ll copy the data into this new memory area.

We can see that the memory we had before a second call to BufferDataARB isn’t necessarily the same exact memory we had afterward. However, it’s still the same from the application’s point of view (same buffer object). But on the driver’s side, we’re optimizing and allowing the application to not wait for the GPU.

Internally, we’ve allocated a large pool of memory that we suballocate from. When we call BufferDataARB, we reserve a chunk of it for the current buffer object. Then we fill it with data and draw with it, and we mark that memory as being used (similar to the glFence function) .

If we call BufferDataARB again before the GPU is done, we can simply assign the buffer object a new chunk of the large pool. This is possible because BufferDataARB says we’re going to re-specify all the data in the buffer (as opposed to BufferSubDataARB).

N.

don’t think that’s the problem, as i allocate using bufferdata with a null pointer, then upload the static data using subdata.

Could this simlpy be attributed to the fact that NT has 64-KB granularity (not “page”-sized) for many things? Try f.ex. VirtualAlloc and you’ll likely get a 64KB-aligned pointer back.

I suspect the same behaviour goes to HAL if you try to allocate physical memory, or map a section - you get 64KB alignment.

To verify this, it could be interesting to attempt to actually map all those created (relatively small) VBO’s, at the same time, and check their base addresses. Perhaps it’s so simple it’s granularity overhead - much like a file of e.g. 1000 bytes on a 4KB-cluster filesystem allocates 4KB?

then why doesn’t it affect the dlist memory footprint?

The individual data buffers are allocated out of a pool of much larger buffers (in this case a pool of what appears to be 4MB buffers) so the allocation granularity only applies to these larger buffers. The results of glMapBuffer(GL_ARRAY_BUFFER_ARB,GL_READ_ONLY_ARB) return addresses aligned on 16 byte boundaries (which makes sense) and are not spaced out a twice the size of the buffer. There is padding between the mapped addresses but it tends to run in the neighborhood of 304 and 320 bytes. Regardless, you shouldn’t see a DECREASE in committed memory when you go from 65535 to 65536 no matter what the alignment.

I think the memory allocator is seriously flawed for smaller size buffers.

“Regardless, you shouldn’t see a DECREASE in committed memory when you go from 65535 to 65536 no matter what the alignment.”

Actually, that may not be correct.

Imagine they use different allocators (quite likely) and use (the equivalent of) VirtualAlloc for 64KB+, but some other algo for less. Now imagine they assume the majority of allocations are 64KB+. (we all know what assumption is the mother of).

I’m not saying this is the reason, but it is a plausible theory.