Map range flags / Buffer usage - Nvidia drivers

Elurahu · September 21, 2011, 10:02am

I’m experiencing some rather unusual behavior which I hope someone here can help me out with.

The setup:

A VBO containing some vertex positions and color.
A VAO which is used for rendering the VBO.

Creating buffer:

glBindBuffer to GL_ARRAY_BUFFER
An ordinary glBufferData with no data supplied.

Filling buffer data:

glBindBuffer to GL_ARRAY_BUFFER
Full range mapped using glMapBufferRange (used for all mapping) with access flag GL_MAP_WRITE_BIT.
glUnmapBuffer

Now if I just rendering using this setup everything runs at about 0.001ms which is just fine. That is what I would expect.

If I however map the buffer - Update the data in it exactly as before. Then render again I get a dramatically higher frametime. Around 0.003ms which from then on stays like that. If I try to update the data each frame (by mapping / unmapping) it runs about 0.01ms which is understandable.

I for the love of god cannot figure out why the OpenGL drivers are behaving like that. I’ve tried a number of things mapping using the MF_INVALIDATE_RANGE_BIT / MF_INVALIDATE_BUFFER_BIT flags which I have yet to have any effect on ANYTHING. Tried orphaning the databuffer by calling glBufferData before mapping but nothing changes the performance.

What I find really wierd is the fact that the data the first time is uploaded exactly the same way as the following updates.

Anyone have any clue / hint / debug I could try out.

On another note - Have anyone ever had the buffer usage hints have any effect using Nvidia drivers? I have yet to see any performance increase / decrease by using the various flags.

Aleksandar · September 21, 2011, 11:42am

Did you try to use glBufferSubData() instead of that Map/Unmap stuff?

The buffer usage hint is only a hint. I haven’t seen any differences by varying it.

Elurahu · September 21, 2011, 1:47pm

Very very wierd - I now tried just a plain glBufferSubData and it runs as I would expect. Again I can understand why it would run slower WHILE mapped (as the buffer in use), but why a buffer which have been mapped before runs slower afterwards I have no idea.

Thanks anyhow Aleksandar

AlexN · September 26, 2011, 5:07pm

The NVIDIA driver most likely sees that you are mapping the buffer more than once and moves it to a different type of memory, which has lower performance for drawing, but supposedly faster performance for updating. I’d say the usage hint is going completely ignored, and the driver is instead looking at your usage over time to determine its own usage hint.

The first time you map the buffer, you don’t trigger this, because many applications map a buffer once to fill it and never touch it again. Applications that map a buffer more than once trigger the more “dynamic” usage hint. I’ve seen this happen and it is pretty annoying - after mapping a buffer a few hundred times, it gets relocated to a different type of memory and never goes back. Sorry that off the top of my head I don’t remember anything more specific that you can do to avoid it.

mhagain · September 26, 2011, 5:23pm

My honest assessment is that for truly dynamic data you’re better off just letting the driver stream it via regular old-school vertex arrays. Yes, that sucks if you want to create a 3.x+ core context, but I’ve never seen any performance improvement for this kind of use case from using a VBO. Often things get worse. The VBO API is too abstracted, the use flags are too confusing and don’t behave as you would expect, and drivers have too much freedom to just ignore what you’re trying to do and make a guess themselves. With old school vertex arrays the driver at least has a better chance of getting reasonably close to doing the right thing.

Seems I’m not the only one either: http://www.stevestreeting.com/2007/03/16/glmapbuffer-how-i-mock-thee/

Dark_Photon · September 26, 2011, 6:47pm

Since where a buffer lives is unfortunately not a spec concept AFAIK and the hints are ignored (more annoying “buffer Ouija board” vaguaries), there can be no official way to do this. However…

I don’t know if it’s still the case, but NV mentioned in reply to a bindless issue report a while back that (as an implementation-specific detail) if you query the GPU address of the buffer or make it resident via bindless, that will lock it in vidmem or sysmem (whichever it happens to be in at the time).

So this for instance should lock it in vidmem or sysmem, depending on whether you define LOCK_VBO_IN_SYSMEM:


  glGenBuffers            ( 1, &handle );
  glNamedBufferDataEXT    ( handle, size, 0, GL_DYNAMIC_DRAW );
#ifdef LOCK_VBO_IN_SYSMEM
  glMapNamedBufferRangeEXT( handle, 0, size, GL_MAP_WRITE_BIT );
  glUnmapNamedBufferEXT   ( handle );
#endif
  glGetNamedBufferParameterui64vNV( handle, GL_BUFFER_GPU_ADDRESS_NV, &gpu_addr );

Of course, you don’t need to use the DSA APIs like I did. Instead if you prefer, bind at-will and use the normal bind point APIs instead (glBindBuffer, glMapBufferRange, etc.).

Warning: if you lock it in vidmem, then it is (intuitively) more expensive to update from the CPU. So if you even do this, only do this if you know that reuse is going to be the much more common case than update (or perhaps you are updating it on the GPU). I used this with Streaming VBOs with good success redispatching batches with bindless that were already uploaded to the GPU buffer locked in vidmem (the 99% case).

Dark_Photon · September 26, 2011, 6:57pm

Agreed.

It wasn’t until NV bindless came out that using VBOs over vertex arrays made any performance sense for us at all. It wasn’t even close. With bindless VBOs, you can get virtually display list performance, which is very sweet.

Still waiting for the ARB to promote this or some form of it that captures the same speed-up to EXT/ARB form. Sometimes you just need lots of varied data, and though streaming VBO approach helps some, without bindless, vertex arrays are the way to go.

Re “streaming VBO approach helps some”, I think part of it is binding is so darn expensive. Seriously, if you’re dead set on using VBOs without bindless, and performance is important, limit yourself to one VBO.