This is true for D3D, but not OpenGL. Draw calls are expensive in D3D because they are done in kernel mode. GL calls are in user mode, and therefore relatively lightweight.
I always had assumed that the paper was discussing a hardware problem, not a software/API one. Granted, until I read that paper, I had assumed that a glDrawElements call, without indices in AGP/video memory, would only do a quick index copy to AGP, add a few tokens to the command stream to the card, and return. Which, of course, didn’t account for the problems with D3D’s calls. I figured that, for reasons that would require intimate hardware knowledge, that there needed to be some explicit synchronization event or something of that nature.
Hmmm… this changes much…
Specifically it is an implementation detail that is invisible to the user, and would therefore never be specified.
What I would like to have is consistent performance. Regular vertex arrays do give consistent performance… consistently slow.
I would much rather see the driver throw an error or something than have it page VBO’s out to system memory. Why? Because that does me little good.
One of the primary purposes behind extensions like VBO is to prevent that system-to-AGP/video memory copy that takes place with regular vertex arrays. Now, you’re basically saying that VBO may, or may not, prevent that copy. It all depends.
It would be very nice if there was an explicit way to let the driver know not to page out VBO’s to system memory.
In any case, you want to give the driver the opportunity to lay these things out in memory the best possible way.
The best possible way to lay things out is to put all static VBOs into video memory and all non-static ones into AGP. Putting either into system memory does precious little for performance.
If you can render the model multiple times when the L&H vectors don’t change, you’d better compute them into the CPU once and then reuse the values.
I don’t know about that. By computing them on the GPU:
-
you save the bandwidth of sending them. That’s 6 less floats, or 24 fewer bytes, per-vertex. This bandwidth could go to more texture fetches
-
you get to have more consistent performance. The worst-case of the CPU approach is (likely) less optimal than the worst-case for the GPU approach. Obviously, the best-case CPU is better than the best-case GPU (for vertex T&L, not transfer). Indeed, the worst-case GPU is the same as the best-case GPU. So, while you may be getting less performance, you’re, at least, getting consistent performance per-frame. Which is often better than having the sometimes-good/sometimes-bad performance.
-
you don’t have to create dynamic or streaming VBO’s. They can all be purely static data. And, therein, lies the possibility for greater vertex throughput (or, at least, more vertex bandwidth).
-
you get more time on the CPU for those sorts of tasks.
-
GPU’s get faster faster than CPU’s. As such, relying now on shader performance makes things easier in the future.
Will anything be done about the 32 MB limit that one runs into? Even with VBO I still cannot create a single vertex array of more than 32 MB or several arrays whose total is greater than 32 MB.
That, I find to be completely unacceptable. I can live with only being able to allocate, at most, a few thousand VBOs, but not being able to allocate more than 32 total MB of memory? No, that is just unacceptable and must be rectified.
[This message has been edited by Korval (edited 05-20-2003).]