To Map() or not to Map()?

I know that this question have come up many times (mostly under different topics though), and I know that most people say to use BufferData() instead of Map(), but I haven’t yet been convinced, and still cannot see why that approach could be better/faster.

Obviously, the main reason for Map() is to write dynamic data straight into host memory. If it was not dynamic, I would upload it only once as a static buffer, and then it wouldn’t really matter if I used Map() or BufferData().

So, dynamic data means it’s created on the fly. Using BufferData, I first have to create my data (that might be huge) in local memory, then call BufferData, that will block until it copies over everything to host memory. On the other had, Map() enables me to write straight into host memory, skipping the expensive copy altogether. Sure, if I try to Map() a buffer that is currently used by the GPU, then it blocks, whereas BufferData could just find some free space somewhere else, and start copying right away! But that’s why one should doublebuffer the VBs (or can even have n buffers in a ring), this way, waiting for the GPU can be eliminated.

Creating this ring of buffers brings up another question. I’ve been told not to pre-allocate buffers, but use BufferData() with NULL instead, so the driver can allocates space without copying anything. This way I could have the best of both worlds: Map() gives me direct access, and BufferData() eliminates blocking without the need to manage a ring of buffers.

While this is all beautiful in theory, I’ve found that manually switching between preallocated buffers is much faster than asking BufferData every time for a new buffer. I guess this is similar in a way to calling “new” in C++ every time, instead of using preallocated buffers.

Please share your thoughts on this subject!

Not sure what you mean by “pre-allocate”, calling BufferData with NULL, do reserve memory somehow.

AFAIR, the procedure to follow to Map() w/o too many problems, is to first call BufferData( NULL ), then call Map( WRITE_ONLY ), so that the driver knows that you don’t care about what’s in the buffer and you’ll overwrite everything.
It should behave as a BufferData call, that is it’s not blocking and using another memory area should the buffer be ‘busy’.

Just checked my docs, here’s something you might want to read (nVidia):
http://www3.uji.es/~jromero/documentos/Using-VBOs.pdf

p10+ are about the different functions and their use.

glBufferData seems to be the only one to avoid syncing, the only alternative being BufferData( NULL ) + Map( WRITE_ONLY ) as I said earlier.
[That behavior should be Vendor agnostic, not sure though.]

-BufferData() is the vendor independant prefered way of updating a VBO.
-BufferSubData() is syncing with nVidia’s GPUs, but not with ATI’s GPUs.
-Map() is syncing no matter the Vendor.

For static data on nvidia & 3dlabs cards use display lists - VBO gives nowhere near the performance of display lists on static data.
Also, it is still the case that you get a significant increase in draw speed if you compile the display lists using immediate mode.
I’ve given up asking why, but it’s probably due to the fact that most people benchmark on old apps which use display lists, and vendors are only interested in looking good in benchmarks.

Not sure what you mean by “pre-allocate”, calling BufferData with NULL, do reserve memory somehow.
What I mean by pre-allocating, is to actually create multiple VBOs during initialization, and then just switch between them. It has the same effect as calling BufferData(NULL), but it’s much faster.

AFAIR, the procedure to follow to Map() w/o too many problems, is to first call BufferData( NULL ), then call Map( WRITE_ONLY ), so that the driver knows that you don’t care about what’s in the buffer and you’ll overwrite everything.
It should behave as a BufferData call, that is it’s not blocking and using another memory area should the buffer be ‘busy’.

This is exactly what I said, but last time I checked it was a lot slower, than the switching I’ve just described. I’m guessing it’s because of the complex memory management, the driver has to do every time, I ask for some space.

Hm, interessting, that preallocated buffers are faster. I just tried adding glBufferData (NULL) to my particle-system and it doubled the framerate. I cannot imagine, that i can gain much by preallocating the buffers, since the memory management shouldn’t be processing intensive, the real slowdown comes, if CPU and GPU have to synchronize, when you want to write to mapped memory.

Also, it should be quite a complex task to manage preallocated buffers in a way which doesn’t waste memory. If i have two buffers and simply switch between them, then i use twice the memory. With glBufferData (NULL) i only use twice the memory as long as one buffer is still rendered from. So, if every dynamic stuff i do, uses twice the memory it actually needed, then i waste a lot of my precious VRAM.

Therefore i wouldn’t do that and simply rely on the driver to do that management for me, which is in fact the complete purpose of VBO. If we do our own memorymanagement anyway, we could have sticked to VAR.

Jan.

Originally posted by knackered:
For static data on nvidia & 3dlabs cards use display lists - VBO gives nowhere near the performance of display lists on static data.

What do you mean here ? We shall display lists all static stuffs on nv and 3dlabs cards, but should keep them inside VBO under other cards ? Is that really a good thing to do that ?

Hm, interessting, that preallocated buffers are faster. I just tried adding glBufferData (NULL) to my particle-system and it doubled the framerate. I cannot imagine, that i can gain much by preallocating the buffers, since the memory management shouldn’t be processing intensive, the real slowdown comes, if CPU and GPU have to synchronize, when you want to write to mapped memory.
Yes, you are correct (in theory). But I’ve just re-did the test with latest 77.72 drivers on my GF6600, and while I get 200+ FPS with VB double buffering, I got ~120FPS when requesting new memory every time by calling BufferData(NULL)…

Also, it should be quite a complex task to manage preallocated buffers in a way which doesn’t waste memory. If i have two buffers and simply switch between them, then i use twice the memory. With glBufferData (NULL) i only use twice the memory as long as one buffer is still rendered from. So, if every dynamic stuff i do, uses twice the memory it actually needed, then i waste a lot of my precious VRAM.
Again, you are correct with regards to wasting memory, although you might not be able to tell the exact size, just an upper bound, even when you request fresh memory all the time.

Therefore i wouldn’t do that and simply rely on the driver to do that management for me, which is in fact the complete purpose of VBO. If we do our own memorymanagement anyway, we could have sticked to VAR.
Well, I would love to have a VB that works fast the way we are supposed to use it, but the hard fact remains that you need to manage stuff yourself, and VAR is way more powerful in this sense.

Trust me: I really wish I didn’t have to deal with managing the buffers myself!! I would love to do it the right way, and trust the drivers to do a good job!!

Andras

Tried BufferSubData/BufferData with none NULL pointer ?

Tried BufferSubData/BufferData with none NULL pointer ?
Nope. BufferSubData stalls by definition, and both need a separate buffer in client memory, which could be pretty big (if that’s not a waste of memory then what is??), and there’s also a full stalling copy involved, I cannot see how this could be faster/more efficient…

Originally posted by andras:
[quote]Tried BufferSubData/BufferData with none NULL pointer ?
Nope. BufferSubData stalls by definition, and both need a separate buffer in client memory, which could be pretty big (if that’s not a waste of memory then what is??), and there’s also a full stalling copy involved, I cannot see how this could be faster/more efficient…
[/QUOTE]Well you may use a ‘cache’ in RAM to avoid creating/deleting memory. You would then fill it up when needed and send it to the VBO using BufferData.
That way you wouldn’t be “wasting” much RAM, and you would end up using the prefered IHV method of updating VBOs…
(That means that you replace the whole buffer of course, you can’t just update a part of it.)

Originally posted by jide:
[quote]Originally posted by knackered:
For static data on nvidia & 3dlabs cards use display lists - VBO gives nowhere near the performance of display lists on static data.

What do you mean here ? We shall display lists all static stuffs on nv and 3dlabs cards, but should keep them inside VBO under other cards ? Is that really a good thing to do that ?
[/QUOTE]Hey, don’t shoot the messenger.

Originally posted by andras:
Please share your thoughts on this subject!
Map() and copy is 15% faster than BufferSubData() on my benchmarks using Catalyst 5.1 X800XT.

Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.

Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.
That’s exactly what I do! I believe that Map() is fundamentally more powerful than BufferData & friends, the only problem is I have to double buffer manually if I want to get the best performance out of it.

Again, could some driver guy look into why BufferData(NULL) seems to be much slower than just switching between preallocated buffer objects?? It would make life soo much easier, if we could just use that!! Thanks!

Just a stab in the dark here, but could it make a performance difference to disable the client state before calling BufferData(NULL)?

I don’t know, will have to try that! But this brings me to another interesting question: So, I was re-reading this nVidia document on VBO usage, and there is a section called “Avoid calling glVertexPointer() more than once per VBO”, where they say, that all the actual setup happens on glVertexPointer call! Now how exactly does this work in the shader era, when we have to bind attribute arrays to locations? For example I have lots of different shaders, and each shader have multiple attributes, and different attributes are stored in different VBOs. So for each attribute, I have to bind the corresponding VBO, and then call VertexAttribPointer(location…) to attach the buffer to a location. And I’ll have to do this every time I change shaders, right? And of course every time I request new memory with glBufferData(NULL)! Or am I missing something? I have to admit that I feel a bit lost here. If someone could shed some light on how this works it would be really appreciated! Thanks!

Originally posted by macarter:
I use a high priority thread to draw and a low priority thread to fill the VBO.
Just some friendly advice, learned the hard way: you shouldn’t use thread priorities at all on Win32.
Quick summary: a high priority thread cannot sleep. It will only yield the CPU when it waits on a waitable object, or when another thread of equally high priority is ready to run.

The result is
1)disastrous performance on single-core machines, if you rely on Sleep, SwitchToThread, Yield, whatever
2)nothing on multi-core or “HyperThreading” CPUs. You don’t lose anything and neither do you gain anything by fiddling with priorities.

Just don’t do it. Please.

[b]

Just some friendly advice, learned the hard way: you shouldn’t use thread priorities at all on Win32.
Quick summary: a high priority thread cannot sleep. It will only yield the CPU when it waits on a waitable object, or when another thread of equally high priority is ready to run.
[/b]
Hmm, I dunno, it works like a charm here… Our main thread doesn’t use a lot of CPU (we make the GPU sweat instead ;P), but it has to be super responsive! So actually, our idle thread uses 90% of the CPU, it’s kinda funny… :slight_smile:

Originally posted by macarter:
Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.[/QB]
Is that well-defined behaviour or just blind luck? Can we depend on that being the case? I would have imagined that AGP-mapped memory would be thread-specific, which would break this behaviour. Doesn’t seem to be the case, but is that true for other OSes, too?

I’ve learned to be very careful about OpenGL and multiple threads, so before I make my design depend on it, I’d really like to have a serious answer, as I couldn’t find anything about it in the spec.

Thanks

Dirk

Originally posted by dirk:
[quote]Originally posted by macarter:
Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.

Is that well-defined behaviour or just blind luck? Can we depend on that being the case? …[/QB][/QUOTE]Threads by definition share memory mappings.

Originally posted by andras:
Hmm, I dunno, it works like a charm here…
Okay, I’m going out on a limb here, but …
a)you’re working on a HyperThreaded P4 and
b)you’ll get the exact same performance with default priorities anyway.

Our main thread doesn’t use a lot of CPU (we make the GPU sweat instead ;P), but it has to be super responsive!
And you’re assuming assigning it a high priority will make your thread “super responsive”?
Well, yes, in some twisted way it will do that. A high priority thread will starve all other threads, it will basically run all the time unless it Waits on some object or its message queue.
But then again, if a thread waits on an object, it will be resumed immediately anyway, as long as it has the same priority as all other currently ready threads.

So there’s your responsiveness. Giving a thread high priority will not increase its responsiveness. It will instead make all lower-priority threads unresponsive.

Please try your software at least once with HT disabled. I’m sure you’ll see what I mean.

So actually, our idle thread uses 90% of the CPU, it’s kinda funny… :slight_smile:
Please don’t tell me you’ve written your own idle thread :eek: