VBOs strangely slow?

Baughn · February 23, 2010, 9:34am

I have here two short programs doing the same thing, namely drawing random points on the screen.

The first one uses VBOs; the second, OpenGL 1.1 drawArrays. There are a couple parameters to play with; the number of VBOs to use in a round-robin fashion, etc. It doesn’t really matter what parameters are used, as the results are approximately the same whatever you do.

That is, that the second, VBO-less program runs roughly ten times faster than the first in all cases, except if you comment out the VBO update entirely and only initialize them at startup, in which case they’re the same speed.

Now, this is hardly the ideal use for VBOs - being used only once, and all - but it still seems odd to me that the difference would be that dramatic. Thus, my question: Am I doing something obviously and horribly wrong?

DmitryM · February 23, 2010, 9:40am

I recall several discussions already claiming VBOs are slow. It almost doesn’t make sense because they are the only choice in pure GL3. So, unless you are not aiming at future or doing something wrong, you can expect them to work at least as fast as vertex arrays.

Anyway, did you try to play with vbo usage hint parameter?

Baughn · February 23, 2010, 9:42am

Yes, all _DRAW permutations have been tested; they make no apparent difference. (STATIC_DRAW was no slower, even… you’d think it would be)

I also tested bufferSubData vs. bufferData in those permutations.

Baughn · February 23, 2010, 10:30am

Orr it could be because the lowermost program actually transfers far less data. Um. Yeah.

Though even with that fixed, VBOs are /still/ 40% slower.

Alfonse_Reinheart · February 23, 2010, 11:15am

Why are you creating the buffer object with GL_DYNAMIC_DRAW in one case, and GL_STREAM_DRAW in another?

Also, what hardware are you running this on?

Baughn · February 23, 2010, 11:38am

I was testing various combinations to see what, if anything, had an effect. So far nothing’s produced any change at all.

Fixing the non-VBO case so the entire array is actually used, I’m looking at ~6,000ms for non-VBO, ~10,000ms for VBO (absolutely regardless of that setting), and ~8,000ms to upload it as a texture which I then proceed not to use.

My preference would be to have a way to do this via DMA. Hand the drivers a pointer which I can tell them I won’t be changing until some later point (at which point I presumably need a sync call of some sort), and let the DMA engine do the memory-copy. I don’t suppose that’s possible?

Alfonse_Reinheart · February 23, 2010, 11:46am

So, what hardware are you running this on?

Also, why are you use the ARB extension function pointers, rather than the core function pointers? I don’t imagine that this will have any effect, but it does seem highly unnecessary.

I don’t suppose that’s possible?

Have you tried mapping the buffers?

Baughn · February 23, 2010, 11:49am

In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array.

All’s well that ends well? I guess, but it’s still not obvious to me why the other way of using them is in this case /slower/.

EDIT: Well, yes, MapBuffer should be able to use DMA transfers… that makes perfect sense. That’s still missing a way to do a DMA transfer from an array that needs to stay readable in main memory, but I suppose it doesn’t matter; I don’t need that ability right now.

And I was using the ARB versions because they have the exact same API, thus are supported everywhere the core versions are (that I’ve seen); the other way around is not quite the case, though I haven’t personally seen that either.

Ilian_Dinev · February 23, 2010, 11:53am

Try also the glBuffersubdata approach.
YMMV

Baughn · February 23, 2010, 12:04pm

I tried it (I think?), just uncomment the glBufferSubData line in the paste… no difference.

glfreak · February 23, 2010, 12:31pm

Try not use VBOs then

The question is, why they push the use of a new feature if it’s not implemented well?

Conclusion, even with traditional glBegin/End you can get an outstanding performance as long as you algorithmically optimize vs. instruction/pixel/hardware optimization.

Personally, I only believe in hardware rasterizer as an alternative fast way to software rasterization. Other than this, try use a pure shader path and see if it’s slower or faster

Alfonse_Reinheart · February 23, 2010, 1:32pm

Try not use VBOs then

The question is, why they push the use of a new feature if it’s not implemented well?

Conclusion, even with traditional glBegin/End you can get an outstanding performance as long as you algorithmically optimize vs. instruction/pixel/hardware optimization.

Did you read the thread? He said, “In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array.” In short, VBOs worked better for him, once he was using the correct API. So your “conclusion” is errant nonsense.

On topic, you should use glMapBufferRange if that extension is available. Using the invalidation flag, you don’t even need the glBufferData(NULL) part.

glfreak · February 23, 2010, 3:00pm

“Try not use VBOs then”

If it’s confusing…because sometimes you change the order of commands and you get different/unexpected results on middle end hardware.

“The question is, why they push the use of a new feature if it’s not implemented well?”

Because ppl talking about expected a huge perfomance gain when using VBOs.

“Conclusion, even with traditional glBegin/End you can get an outstanding performance as long as you algorithmically optimize vs. instruction/pixel/hardware optimization.”

A better way to optimize software

Ilian_Dinev · February 23, 2010, 3:05pm

…
Try shoving 10 million tris to the gpu per frame at 60fps without VBOs, while needing flexibility that display-lists do not give (and you’d like to not waste VRAM for the different permutations required otherwise with DLs).

KariGordon · February 24, 2010, 3:48am

Use GL_STATIC_DRAW and don’t update your data pointer (via glBufferData or glBufferSubData)in your for loop. I would think these calls cause the geometry to be sent over to the graphics adapter every time.

Baughn · February 24, 2010, 7:44am

That’s pretty much the idea. I don’t actually update the array here (because I just want to benchmark transfer speed, not random-number generation), but the target program writes new data every frame.

Well, mapping the buffer works very nicely.

Dark_Photon · February 26, 2010, 5:39am

That’s interesting. When I’ve tried Map vs. Sub, Sub was faster (with invalidate of course, so multiple buffers are in-flight in the driver [allegedly], and fixing the VBO max size – no resizing).

But yeah, pure VBOs are odd. You’d think they’d always be faster, but some of the time they’re slower (most of the time on pre-SM4 cards). Unless you play the “Ouija board” correctly per card per driver rev.

[li] Map vs. Sub. [] Invalidate vs. not. [] Sync vs. not. [] Static vs. stream vs. dynamic. [] Dynamic max VBO size or not. [] Interleaved attributes vs. separate. [] Multiple batches per VBO vs. not. [] Mixing index and vertex arrays in one buffer or not. [] Max VBO size X or Y. [] 32-byte-aligned verts or not. [] Ring of N buffers or one. [] Vtx fmts X or Y for colors, normals, texcoords, etc. [] Latency between upload and use X or Y. [*] Call glVertexPointer first, last, or in between.

On pre-SM4 cards VBO perf used to be a total crapshoot, with it more likely to be slower than client arrays than faster, and that’s without any dynamic VBO updates (you laugh, but we still have customers in the field with these and thus have to support them; these cards are only ~3yrs old and our customers use lots of GPUs). For recent gen cards, it’s getting easier to be faster with VBOs, though still possible to find cases where VBOs lag. Batch setup seems more expensive with them than client arrays.

VBO updates aside though, I will say I am pleased with VBO performance on recent cards particularly using NVidia’s bindless batch data extension. With that, I can get very near to the performance of their legendary display lists (it’s ~2X slower without bindless). So no doubt NVidia display lists use bindless internally (of course). VBOs+bindless is definitely the future (unless they come up with something even faster )

The question is, why they push the use of a new feature if it’s not implemented well?

That’s a very good question. VBO’s would have been a much easier sell if they didn’t positively suck when they were first introduced, which lasted for several generations of cards. They’re still a Ouigi board, but the Ouigi board has gotten much smaller on recent cards.

Another reason VBOs weren’t such a slam-dunk sell is the vendors did not provide guidance to say specifically “this is how you get the fastest VBO performance on our cards: use permutation A,B,C,F,M,P,R”. And when there was a tip dropped, if you tried it, half the time it was worse performance.

Dark_Photon · February 26, 2010, 7:04am

Could you post your exact map code example in a follow-up? I think it’d be useful/informative for a number of folks to run all 3, and let you/us verify that everyone’s seeing similar results on varying GPUs/vendors/drivers on exactly the same code.

zweifel · February 26, 2010, 7:19am

Not sure if someone said it before but, what are your GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES.

Extending those things will have bad results.

Dark_Photon · March 1, 2010, 5:20pm

Just for kicks, and to come to some VBO upload performance conclusions on modern hardware (at least with this 8MB/upload example), I thought I’d take the original two permutations (same VBO sizes/contents/rendering), and try a few variations for comparison:

2.163s - Client arrays
2.801s - BufferData load/reload
2.876s - BufferData NULL, BufferSubData load
1.985s - BufferData NULL, MapBuffer load, UnmapBuffer
2.013s - glMapBufferRange MAP_INVALIDATE load, UnmapBuffer
2.078s - glMapBufferRange MAP_INVALIDATE load, UnmapBuffer with buffer load/use lag of 2

Test setup:

NVidia GTX 285 GPU, Core i7 920 CPU, PCIe 2.0
NVidia 190.32 Linux drivers

Option #3 used to be the fastest. But on modern hardware/drivers it’s now the dead slowest , at least with this example.

Also, options #4 and #5 can be made ~60ms faster (3%) merely by using fewer buffers (e.g. 1 instead of 3).

It’s interesting to note this nets an upload rate of ~2 GB/sec (6.4 GB/sec practical max PCIe2, 8.0 GB/sec theoretical max, 8.3 GB/sec theoretical max CPU mem).