I see batching as among the biggest problems in the future of graphics. As worlds and characters get ever more complex, issuing all the draw cammands will weigh heavily on performance.
You seem to misunderstand my point.
Let’s say you have a program with plenty of CPU time to spare. So, you decide to change the stripping. For every 1 strip, wherever possible, you split it up into 5. Hence, you will need to call glDraw* 5x more than before.
Assuming that the program was vertex transfer limitted to begin with, the performance will drop primarily because of caching behavior dealing with vertex data. That is, the card works best when it reads a long, unbroken string of indices. You can mitigate this somewhat easily enough by putting the indices into a contiguous array. At this point, the performance penalty comes from only 3 potential places:
1: Function call overhead on glDraw*. In our case, we have plenty of CPU time, so this is negligable.
2: Driver marshalling of GPU commands. The boneheaded way of implementing glDraw* is to immediately put the commands into the GPU’s FIFO, which could require a switch to Ring0 on the CPU (a slow operation). Few GL drivers do it this way. Drivers marshal GPU commands pretty efficiently these days.
3: Some oddball GPU problem. For whatever reason, the GPU has some significant delay between primitive batches. I have no factual, or even speculative, reason why a significant delay would exist.
1 is trivial, 3 doesn’t exist, and drivers are pretty good at 2. Where’s the batching problem?
Now, you might have read a PDF on nVidia’s site about the importance of batching primitives. They suggest taking drastic measures to get large batches of primitives, because a 1GHz CPU only gets something like 10,000 batches. This PDF only refers to D3D, because D3D can’t do #2 well at all. It has to use the “boneheaded” method, because of how the D3D driver model works. GL drivers can, and do, perform appropriate marshalling of GL commands.
None of this is to say that you can send a mesh as a sequence of 1-triangle-sized glDraw* calls. While #3 may not be significant, it is still there, and for rendering large numbers of polygons, it can add up quickly. But, for real numbers, it is quite negligable.
Note that this assumes the use of VBO index buffers as well as ATi hardware. I’m not sure about FX hardware, but I do recall that nVidia hardware through the GeForce 4 definately had issues with the concept of index buffers. While they clearly support VBO index buffers well enough, it is clearly stated that the buffer object containing indices should be a different object from the actual mesh data, as this allows for implementations that can’t handle video/AGP memory with indices. The general assumption about this level of nVidia hardware was that the driver, upon receiving a glDraw* command, was required to copy the given indices directly into the FIFO/Marshal queue, which obviously doesn’t work well if they are in video/AGP memory.
ATi hardware, of R300 calibur or better (if not R200 hardware), doesn’t have this limitation. As such, all it needs to do is copy a 16-32 byte instruction opcode sequence into the FIFO (telling the GPU where the index buffer is, how long it is, and the format) for each glDraw* operation.
It is likely that NV30 fixed the nVidia issue, since NV30 supports primitive restart, which presupposes a better command processor/primitive unit.
The batching issue has to be addressed by someone.
The batching issue is, to my mind, resolved with degenerate triangles in strips. With the exception of one thing: Triangle Fans. I would dearly love to fan my terrain, but I can’t due to the performance impact.
As such, the only times I make multiple glDraw* calls are for either particles or for state changes. My batches tend to broken up by state change far more than by anything else.