VBO Performance Strategy

Carmack didn’t create MiniGL, 3Dfx did to cover the functions he was using in glQuake.

Here’s a quote from his .plan 11/23/96

“GLQuake: 3DFX is writing a mini-gl driver that will run glquake. We exepect very high performance. 3Dlabs is also working with me on improving some aspects of their open-GL performance.”

The implementation details of the MiniGL driver are not the issue. Nobody appreciates Carmack’s contributions more than me. Just because I make a simple factual correction, don’t assume I’m taking a position on other things you post.

[This message has been edited by dorbie (edited 01-21-2004).]

dorbie, i am sorry. i had wrong infomations.
thanks for correcting me.

As a side note, there’s something I’ve been wondering recently. When writing to mapped VBO memory residing on the graphics card, is there likely to be a major performance hit for scattered writes, as opposed to nice tidy writes to sequential memory locations? I’d imagine so, but don’t know a lot about bus architectures.

If nobody knows offhand, I’ll test it and see; I’m only asking 'cos I’m lazy…

If you do test this, could you please post the results. I am also curious. Thanks.

When writing to mapped VBO memory residing on the graphics card, is there likely to be a major performance hit for scattered writes, as opposed to nice tidy writes to sequential memory locations?

It is implementation dependent. However, the fastest implementations of mapping will definately have this limitation. So, you should probably assume that sequential writes are the way to go.

We are using in our engine all the vertex buffers implementations: VBO, VAR and VAO. While at the beginning the VBO was somewhat slower than VAR and VAO with time, both NVIDIA and ATI improved their drivers and now VBO is somewhere 5-10% faster. Due to our games types, we have many vertex buffers (thousands in a scene and hundreds visible per frame). The VBO’s design makes the update and rendering calls less expensive when you have many objects. Also, the client code path for VBO is far less complex than, for example, the memory management for VAR. Furthermore, even NVIDIA tells developers to use VBO rather that VAR (which is a rare thing considering their position regarding own extensions vs. ARB extensions).

I expect in the future VBO to become faster and faster while custom vendor extensions to be kept in the driver only for compatibility with older applications. New hardware will be made to be fast for the VBO and the presence of this standard extension will make different vendors hardware to be very similar in usage. Without this extension it would be no hope for unified geometry data management at all.

Originally posted by Korval:
[b] Well, I read that slightly differently. I don’t read “lightweight” as “free”, but relatively “low-cost”. A bind (specifically, a glBindBuffer followed by a gl*Pointer call) could require an upload of a vertex buffer object that has been paged out back into video memory. This will require going through the cache. Now, unless you are constantly trashing, this operation should be virtually non-existant if the buffer is resident.

I do agree that the, in general, correct usage pattern is one VBO per object.[/b]

Binding buffers and setting pointers is so lightweight that you will never have to bother about it! I have tested it with about 12000 small objects (less than 100vertices), each having a separated VBO for Vertices,Normals and Texture coordinates. The binding was about the 1/100th of the time, the calling of glDrawElements took. Maybe i misunderstood the result of the profiling, or dont know exactly what is behing calling these functions, but i think VBO’s driver-side management is well developed (i could even achieve the maximum vertex throughput with 180mbytes of model data loaded this way!!!). (dont know if nvidia’s solution is as good as ati’s, but it seemed that nvidia supports VBO more than VAR)…

Originally posted by licu:
Due to our games types, we have many vertex buffers (thousands in a scene and hundreds visible per frame). The VBO’s design makes the update and rendering calls less expensive when you have many objects.

Are these results for one big buffer, or one buffer per object?

I’ve never quite understood how VBO is better than display lists for the latter case.

Originally posted by MikeC:
[b] Are these results for one big buffer, or one buffer per object?

I’ve never quite understood how VBO is better than display lists for the latter case.[/b]

You should read the specs of VBO and display lists…

Originally posted by orbano:
You should read the specs of VBO and display lists…

I have. In the general case, and for dynamic data in particular, sure. But for static data, with one buffer per object, I can’t see any win over DLs except maybe a slightly faster setup, and I’d imagine that the DL will compile to something very like a static VBO behind the scenes. I suppose it boils down to a tradeoff between elegance (consistent use of VBOs throughout) and compatibility with old drivers.

If I’m missing something (entirely possible) feel free to point at me and laugh, but I’d appreciate it if some kind soul could put me out of my ignorance.

Originally posted by MikeC:
If I’m missing something (entirely possible) feel free to point at me and laugh, but I’d appreciate it if some kind soul could put me out of my ignorance.

If you’re using LODs for your geometry where the different levels share the vertices but use different indices it may make a big difference in terms of memory usage.

Originally posted by stefan:
If you’re using LODs for your geometry where the different levels share the vertices but use different indices it may make a big difference in terms of memory usage.

Hmm, good point. Not sure I’d want to do LOD that way - it costs the footprint of the highest-resolution LOD even if you’re only using the lowest-resolution one, and doesn’t sound very cache-friendly - but it’s an interesting approach.

Thanks,
Mike

Hmm, good point. Not sure I’d want to do LOD that way - it costs the footprint of the highest-resolution LOD even if you’re only using the lowest-resolution one, and doesn’t sound very cache-friendly - but it’s an interesting approach.

Well, the foot-print problem is certainly liveable, as you don’t want to make the driver do an upload of a new VBO (if it was paged out) just for a new LOD.

And no, it isn’t cache friendly, so your lower LODs won’t be as fast as they could be. But you save memory on having to have multiple VBO’s around, so it can definately be worth it.

I don’t know if this example is any good.
If smaller geometrical LODs are small enough™, the memory usage is well bounded and won’t be much of a problem. Compare this with mipmapping, which is absolutely bounded at 30% more memory, regardless of how big the base texture is.

I’m a bit fuzzy on the math right now, so I don’t know whether LOD n+1 needs to be strictly a quarter of LOD n in size, or if any exponential decay is fine.

Originally posted by Korval:
you don’t want to make the driver do an upload of a new VBO (if it was paged out) just for a new LOD.

Really? In an ideal world, no. But if you aren’t currently using an LOD, and it’s hogging vidmem needed by things you are using, paging it out until it’s needed sounds like a perfectly reasonable thing to do.

Yes, any exponential function will do:
liman=1/c^n
where c>1 (yeah i know its not a function, but dont know its name in english)
And about DLs and VBOs. AFAK DLs dont have to be in the video/agp memory. its just compiled to ogl’s own memory. please kick me if im wrong, but that is how i understood ogl specs.

Well I have tried both 2 technique for my landscape(using the same VB with different LOD IB and geomipmapping VB too) and the results are totaly the same.
With the first method there was 5Mo of VBO and with the second only 350ko (50k tris rendered with both methods). Note that if a chunk of land change his LOD, he delete his VBO an create a new one. All VBO are statics.

Tested une nVidia Geforce4.
A screen: http://esotech.free.fr/Clipboard01.jpg

It’s not the size of VBO who matter but how much you draw from it.

Originally posted by MikeC:
[b]As a side note, there’s something I’ve been wondering recently. When writing to mapped VBO memory residing on the graphics card, is there likely to be a major performance hit for scattered writes, as opposed to nice tidy writes to sequential memory locations? I’d imagine so, but don’t know a lot about bus architectures.

If nobody knows offhand, I’ll test it and see; I’m only asking 'cos I’m lazy…[/b]

On PC’s, this is not possible. If you wish to write to such a VBO, then the object will be brought into system memory because on PCs, there is a risk that the buffer may be lost.

Read the part tha says
“What happens to a mapped buffer when a screen resolution change or
other such window-system-specific system event occurs?”

I’m not sure though. It could be system mem or AGP.

Originally posted by Korval:
nobody has released a product that uses VBO’s professionally.

I know of at least one application shipping that uses VBOs. We used them on Homeworld2 as the preferred method of storing geometry. Homeworld2 shipped four months ago in September 2003.

Homeworld2 made a few magazine covers and won its share of game of the month awards, so I would think it is popular enough to count for something, but I won’t compare its popularity/performance influence to what I expect from Doom3

A bit about HW2 for those interested because I haven’t really posted here much. HW2 uses VBOs, and if they aren’t supported HW2 falls back to display lists. We detect the renderer and driver version and look up in list of known buggy drivers if we should or should not use display lists. Quite often we also disable display lists and fall back on (compiled) vertex arrays. Ugh… I think with VBOs we may never see bug free display list support from certain vendors

Homeworld2 uses fragment programs for all rendering on advanced cards and has a bit of use of vertex programs too. Shadows are done with shadow maps. We don’t support VAR or VAO, only VBO.

We used one VBO per object which I assume is the way one should try to use them. VBOs should be getting pretty stable as they are part of the new core and the standard is a year old.

We used one VBO per object which I assume is the way one should try to use them.

OK, that’s one shipping game that uses the one-VBO-per-object pattern. Good.

I seem to recall that ATi’s drivers at the time of HW2’s release had some issues with the game. Did this have something to do with their VBO implementation at the time, and did ATi correct the problem?