What is the optimal size for vertex arrays passed to glDrawElements?

I know that with Direct3D, nVidia did testing with their GeForce2 cards and found that vertex buffers with around 4096 vertices were the fastest; using fewer but larger vbuffers was slower, as was using smaller but more numerous vbuffers.

Source (Excel): http://developer.nvidia.com/docs/IO/1319/ATT/vb+stats2.xls

I’m wondering if it matters what the vertex array size is when passed to glDrawElements. Is it alright just passing arrays of 65536 vertices, or might it be faster to split the array into a few smaller arrays?

There are two implementation-dependant GL-variables that give information about :

1- the maximal recommended number of glDrawElement() indexes :

int maxI;

2- the maximal recommended number of glDrawElement() vertexes :

int maxV;

I think this is what you’re looking for.

[This message has been edited by MPech (edited 04-27-2002).]

Thanks! It said 4096 for both, with my GeForce2. Perfect. I had to use GL_MAX_ELEMENTS_VERTICES_WIN and GL_MAX_ELEMENTS_INDICES_WIN though, as the two you said were not defined. Does WIN mean for the Windows platform, or for windowed mode?

Strange… I use these constant on both Linux and Windows plateform. Never heard of *_WIN, but it may concern the plateform and not the mode.

Where are your GL include files coming from ?
I use those gived by Visual C++.

[This message has been edited by MPech (edited 04-27-2002).]

The Red books says GL_MAX_ELEMENTS_INDICES and GL_MAX_ELEMENTS_VERTICES are only for glDrawRangeElements(), not glDrawElement().

But I could be misinterpreting.

I know those two var can be defined as *_WIN. I remember having the same problem.

I believe I downloaded them from microsoft.com relatively recently. Anyway it doesn’t matter what they’re called, only what they are: 0x80E8 and 0x80E9. What are the values of the non-_WIN ones?

I have the Red Book, and those constants are under the description of glDrawRangeElements; you’re right, GPSnoopy.

I think the reason they didn’t print it under glDrawElements() is that they assume you’ll use that for an entire unbroken array of data and that you wouldn’t be splitting the array into smaller ones beforehand. If the code assumes that, maybe it would be better using arrays of 65536 vertices and rendering 4096 at a time with glDrawRangeElements…

How is rendering more than 4096 tries working out slower? Do you just mean it locks up the CPU more, so you lose parallelism between cpu/gpu?

I fail to see how the GPU would process poly’s slower for them being in large batches.

Anyone care to enlighten me as to how?



Suppose there’s a limit to the size of the currently active scatter/gather table for the card’s DMA engine? Another way to think of it is that the card may implement its own MMU, and there’s a limit to the number of TLB entries.

Perhaps there’s also some limitation on some counter/register somewhere that can’t go higher than 12 bits in one go, so any count greater than that has to be split in two, meaning the driver has to wait for an interrupt, or at least queue a second command, for the second half of a buffer with 4097 items in it.

I’m sure we could come up with more plausible explanations if we thought a little more about it.

These numbers are not particularly meaningful.

If you’re spooling out dynamic geometry, the main issue is the CPU cache size. (VAR solves this problem by using uncached memory.) Pretty much beyond our control how you lay things out and how it collides in the cache.

VAR has its own max # of vertices – you can query the max index, which is 2^16-1 on NV1x and 2^20-1 on NV2x.

Then there’s all the caching of vertices post-T&L, which is of course <<4096 vertices.

And for CVA/DRE, our buffers to copy the vertices into are of limited size, but we can’t meaningfully expose that because different vertex formats use different numbers of bytes! Think of it as a VAR implementation internal to the driver.

For the record, I do want to remind people that if you’re using UNSIGNED_INT indices with VAR, DRE can be a definite win. In short (bad pun), the reason is the 2^20-1 index limit. If your “end” value is <=65535, then we know that your unsigned int indices can really be copied as shorts, which saves memory bandwidth. (If you use UNSIGNED_SHORT, we already know that by default; and UNSIGNED_SHORT, of course, already saves bandwidth by virtue of being smaller.)

  • Matt