Batching and VBO's

Originally posted by Zulfiqar Malik:
I have tried doing glVertexPointer(…) just once in the start and not doing it each frame, but that doesn’t give any speedup at all!
This won’t change anything since you use a single VBO with a single array. You really can use several VBO with each having several arrays without noticing any performance issues. (this of course can depend on several factors).

Originally posted by Zulfiqar Malik:
[b]Tried with a couple of earlier versions of nVidia’s display drivers, but to no effect. Getting almost the same results :frowning: .

This scenario is driving me nuts, i am gonna post it on nvdeveloper forums.[/b]
Exactly what is the size of the VBO you’re using?
When I did my benchmarking, I found there was a limit to the size of VBO before performance started to suffer. Split up your 1 million verts into several buffers and do the test again.

Finally got it fixed!
Luckily i found a very small reference regarding VBO performance in GPU Programming Guide on nVidia’s website, and it mentions that performance with VBO’s on nVidia’s hardware is optimal when batches of 64K are used. Bigger batches can cause a negative impact on performance. I did just that and, the fill rate increased from 40MTris, to 110MTris/s on 640x480, still a little less than what i would have liked but good enough :slight_smile: . I wonder if there is a similar limit (limit in terms of performance), on ATi’s hardware because i couldn’t find any references on ATi’s website (Humus?). Nonethless, the problem has been fixed.
These graphics companies keep on saying bigger batches are better and never do they mention the upper limit for optimal performance, except for this very tiny reference in GPU Programming Guide. I think that they should be more vocal about this!

Yep, that’s exactly what I found when doing a terrain engine. I’m pretty sure nvidia mention it in their various pdf’s.
You’ll find the same with texture sizes, there’s a sweet spot of texture size when uploading texture data dynamically.

If you gradually increase the batch size from 64k does performance suddenly drop or is it a gradual thing.

I wonder if VAR had this issue. I suspect not.

Out of curiosity what is the AGP speed of your motherboard? 4x or 8x?

@Zulfiqar: did you try to align your vertex data on 32 byte boundaries, e.g.:

struct Vertex {
    float x, y, z;
    float w; // always == 1.0f
}; 

@knackered: what is your sweet spot w.r.t. texture sizes and texture formats (e.g. precompressed S3TC)? :slight_smile:

Originally posted by Adrian:
[b]If you gradually increase the batch size from 64k does performance suddenly drop or is it a gradual thing.

I wonder if VAR had this issue. I suspect not.

Out of curiosity what is the AGP speed of your motherboard? 4x or 8x?[/b]
As (if I remember well) he uses static arrays, this won’t have any change at all but during uploading time which won’t be noticeable anyway.

Originally posted by jide:
As (if I remember well) he uses static arrays, this won’t have any change at all but during uploading time which won’t be noticeable anyway.
And yet his last post, which is four posts above the one I’m quoting now btw, proves the opposite :rolleyes:

Edit for clarification:
This is the proven-wrong part: “won’t have any change at all but during uploading time”.

@Adrian: Yes, adrian, performance dropped suddenly. In one instance i was using slightly less than 64K batches and was getting close to 100MTris, in the next instance i increased the batch size to around 66K and the performance dropped to around 40MTris.

@Hampel: Yes, i did align the data on 32-byte boundary with no performance improvement. Secondly, i used 3 short vectors (total of 18bytes per vertex), and wasting 14 bytes just to get the alignment right wasn’t much of an option for me :slight_smile: . Memory usage in my case is critical! I don’t know whether the driver will internally align it to 32-byte of not, but i think it should not!

@zeckensack: Jide is right zen, i specified batch size at compile time, and my vertex buffers were always static :wink: .

Originally posted by jide:
As (if I remember well) he uses static arrays, this won’t have any change at all but during uploading time which won’t be noticeable anyway.
What if the static arrays were being stored in AGP memory and not video memory? Then agp speed does matter since it will always be uploading the vertices. What I think is happening is that when you have batch sizes over 64k they are put in agp memory otherwise video memory. Zulfiqar, is your motherboard agp 4x or 8x. If it is 4x then I think my hunch is right if its 8x then it probably isnt.

What you say is pertinent Adrian. However I can’t see why data will be stored in AGP memory almost as he is not using any texturing at all and such so the graphic card uses very few memory.

Generally (from what I know) AGP aperture size is only used when there isn’t enough memory anymore in the graphic card. Maybe with trying to limit (or best disable) AGP memory would help to know…

Why memory would be allocated in AGP instead of the graphic card if batches are more than 64KB ? Anyone has information about that whole ?

@Adrian: I have tried these scenarios on two machines. GFX 5700Ultra (AGP 4x), Motherboard also AGP 4x, and GFX 5900XT (AGP 8x), Motherboard also AGP 8x, and on both machines i get exactly the same throughput (in terms of raw triangle count) depending on batch size.

I was just wondering, maybe someone can test this on an ATi card, or if there is something documented online for ATi cards in terms of best batch size? I do have a 9700pro machine, in the office, and i will test it out on that, but in a couple of days from now, perhaps.

I can’t help but agree with jide, that with the video memory being almost free, the driver should use it whether the batch size is optimal or not, and reside to AGP memory only when the video memory is exhausted or at least the maximum amount of memory to be dedicated to vertex data is exhausted (if there is such a thing as the maximum memory for vertex buffer!).

Maybe some engineer from nVidia’s driver team can answer this question, but i haven’t gotten any response so far, on nvdeveloper forum.

That information suggests its not to do with agp/video memory since I would have expected a difference in performance between the agp8x motherboard and agp4x.

the days of nvidia driver writers frequenting these forums are long since gone. We used to have Cass Everitt and Matt Craighead at our beck and call, but not anymore…those were the days my friend, when register combiners ruled.
The very most you can hope for nowadays is that Humus (who now works for ATI, in what capacity I don’t know) might pop upstairs to the driver writers and ask them what they think.
Or you can all just carry on guessing why there’d be a 64k limit on a VBO batch on nvidia hardware.

I think it’s as simple as you can’t address more than 64K vertices using ushort indices, and the hardware is only optimized for this case.

@tamlin: I tried with all sorts of variations my friend, ints, shorts. Similarly for vertex data, floats, integers, shorts etc. etc. Quite a few permutations i tried, and each time i got the same results. I would have been lost still had i not found that GPU Programming Guide excerpt. Do get it from nVidia’s site and search for VBOs, there are just a few lines on VBOs and one of them mentions 64K as being the most optimal (NOT LIMIT!) batch size for nVidia hardware.

Originally posted by Zulfiqar Malik:
I would have been lost still had i not found that GPU Programming Guide excerpt.
Err, well no, you wouldn’t have been lost because I told you the problem a couple of posts ago. It’s actually a fairly well known fact, and a quick search in these forums would have given you the answer in minutes.

hahaha, sorry knackered, totally forgot about you. Thanks for reminding me.

Originally posted by Zulfiqar Malik:
I tried with all sorts of variations my friend
Seems I wasn’t clear enough. I was referring to the 64K-vertices threshold. I can easily imagine the h/w having hard-coded lookup tables for popular vertex sizes for unsigned short indices, but for more unusual vertex sizes or, what I was thinking of, more than 64K vertices they would take not-so-well optimized path in the hardware.

Unless already in the pipe(s), I think this is something hardware vendors should start to seriously consider. Especially when they nowadays have 1/4 – 1/2 GB of RAM on board, and it seems likely more will come. For currently some cases, and in the future likely way more, 64K vertices/VBO will not be enough. It therefore makes sense (to me) to up that 16-bit index limit to at least the next level - 24 bits (that should last a while, no :slight_smile: ) - even if the indices by then must be 32 bits (2^n is a bitch at times).