Best draw method for small multipass scene?


I am working on a simple fps engine that uses a map scheme similar to the original doom (or marathon, for those that have played it), so that the levels can be created and edited in a very simple and real-time manner. Basically a map consists of a set of 2d polygons that share boundries; each poly has a ceiling and floor height for the 3d view. I’ll be doing multipass rendering with per-fragment lighting and stencil shadows etc - so basically I’m doing very taxing and repeated operations on a small geometry set.

My question is this: what is the best way to pass this geometry to the GPU? Because my visibilty algorithm is very aggresive, almost no triangles will be drawn that are not onscreen or even occluded (this is a natural benefit of using my simple map scheme), but it means that the geometry set is for all practical purposes changing in size every frame. Another issue is that a given “map polygon” (i.e. one of the 2d polys i mentioned above), may have different materials for the floor, ceiling, and each wall, making minimizing state changes at render time challenging. So to simplify my problem statement, I have a whole bunch of small batches (<100 or so usually), each of which consist of something like 2-10 triangles. Many share materials, so it would seem advantageous to group them somehow each frame.

I’ve considered a number of options of course, but none of them seem like good solutions in all respects,

  • VBO’s are fast and great for multipass rendering, but using them for anything but large-ish static objects is not beneficial in my understanding

  • Simple vertex arrays [i.e. glDrawArrays or the like] seem the best bet sofar, but organizing all my polygon data into arrays by material every frame seems prohibitive. I could do a very large number of small arrays, but this too seems wasteful. Would indexed arrays perhaps be better somehow?

  • Good ol’ immediate mode is certainly handy for dynamic stuff, but I fear to do so because of the tremendous API overhead incurred with so many calls.

I am also debating what primitive type to use; triangles, triangle fans, tri strips, polygons? quads?? All of my batches will be flat and convex, if that helps.

Also, should I even be worrying about this? If my complex lighting and stenciling is acting as a large bottleneck, should I even care if I’m sending 100-1000 triangles in a semi inefficient manner for each light every frame?

So I welcome your advice and suggestions.

If the primitive count is low, the hands-down winner is triangle lists; it’s easiest to manage and nothing else performs better.

Depending on the size of your level, you could stick ALL vertices in a single array, and then build a new index list for each frame. The run-time data for a polygon is then [material pointer, index list], and the vertex set is static and assumed. Traversing visible geometry then means first clearing out the list of indices per material, then traversing all visible floor polys (using portals), and adding the index list for each poly to the material pointed at by that poly. In the end, you traverse the materials, and draw those one at a time, which takes care of sorting.

If your level is really Marathon-sized (8 verts per floor poly, can’t see across more than 8 floor polys at a time, etc) then you can probably go ahead and just issue geometry with glVertex() as you traverse it, or copy pointers to vertices into a list per material, and then traverse the materials, issuing geometry with glVertex(). If you stay under 5,000 vertices per frame on high-end hardware, or 1,000 vertices per frame on middling hardware (GF2 / 1.0 GHz CPU), this is probably not going to cost you anything, and might actually be among the faster. At that point, you may be better off using triangle fans for the floor/ceiling polys, and quads for the walls.

Thanks for your advice;

I had considered the index array strategy you mentioned, but my concern was that when you send the array of indices and the array of coordinates to opengl that it sends ALL the coordinates through the vertex engine and then afterword forms triangles based on the indices (hence providing the speedup of not transforming the same vertex twice). This would be bad for me obviously if i had all my geometry coordinates in a single array, since they would all be transferred and transformed regardless of the indices actually used.

Does anyone know one way or another how opengl (or a gl driver) handles this?

i, for one, wouldn’t want to depend on a particular implementation’s details. jwatte’s advice is sound for any implementation. that’s what i would shoot for: something that’s good across the board. suppose you know the details, what then? what happens in 2 years when the details are different? like the cpu l2 cache being moved into the architecture, and an added l3… whoda figured?

i don’t know the anwer to your question though. i’m reasonably sure it’s safe to assume cards have have some kind of vertex cache. but the size and organization are unkown to me. if the api is designed well, then those details should be managed neatly by the interface. that is, it should be possible to make intelligent use of the api to stream vertices efficiently, without any prior knowledge of the implementaiton. if not, then it’s a poor design, imho.

Originally posted by Jeff Russell:
my concern was that when you send the array of indices and the array of coordinates to opengl that it sends ALL the coordinates through the vertex engine and then afterword forms triangles based on the indices (hence providing the speedup of not transforming the same vertex twice).
That’s not quite how it works. The hardware should not transform vertices that aren’t actually referenced. Duplicate transformations are avoided to some extent by use of a small vertex cache (IIRC it was 24 vertices on GeForce3, not sure about newer cards). What this means is that if you specify the same index twice in a row, the vertex will be in the cache and won’t have to be retransformed.

Obviously, if the two references of the same index are too far apart, the vertex may have disappeared from the cache and will need to be retransformed. Also, because indices are compared instead of actual vertices, you have to use indexed triangle lists/strips/… in order to benefit from the vertex cache.

– Tom

If you use LockArrays() or DrawRangeElements() on a software-transform architecture, it may first transform all the vertices, and then pick them apart using the index list.

However, with any modern HT&L, this isn’t happening anymore. For GeForces, with regular vertex arrays, we’re told that they will walk the index list, expand the referenced data into DrawArrays-style arrays, and then send it to the card. For the LockArrays or DrawRangeElement case, as well as VBO and VAR, they probably copy the indicated range to a buffer, and actually send indices to the card.

I have little insight into what the ATI story is, but I wouldn’t be surprised if it was somewhat similar.

I have little insight into what the ATI story is, but I wouldn’t be surprised if it was somewhat similar.
ATi cards of R300 calibur or better (possibly R200, but I’m not sure) do not need to have array indices copied into the FIFO directly the way older GeForces (not sure about FX’s) needed to. The FIFO, in this case, would contain a pointer to the index buffer and the number of indices to use.

One problem with efficient geometry is that it pushes the bottleneck to the CPU, particularly if the CPU is busy doing other stuff. Sophisticated triangle culling might be overkill in this case; a simpler, less aggressive triangle culling method might give you better performance by balancing the work between the GPU and CPU.


if you’re keen on thumb-twiddling, a precomputed visibility matrix is awfully hard to beat for trivial culling, particularly for doom type worlds. it’s likely this would be faster than portal rendering, as it only requires a runtime LUT to determine leaf-to-leaf or leaf-to-cluster visiblity. having tried both cluster-portal rendering and precomputed vis, i can personally attest to the latter’s efficiency. afaik, doom3 is still using precomputed vis, while unreal opts for cluster(zone) portals. i guess it’s a trade-off in designs, preprocessing time vs. runtime performance.

Visibility look up tables are a good fit with my project - I have finalized the decision to use them. I was going to do a sort of portal method as mentioned above, but really there is no reason for this, other than it would result in a few less frustum visibility checks, which really are pretty cheap anyway. The pre-processing time for a LUT will be trivial for these types of maps, in fact I may even do it at load time to simplify my file format.

As for actual drawing however, I am still stuck between vertex arrays, indexed vertex arrays, and immediate mode. I guess I will just have to try test cases for all 3 and see what is best.

personally, i’d go with the indexed vertex arrays. if you’re going with the vis table, then you’ll likely be rendering in leaf order. this is really nice for BSP based worlds, since you can zfill without a ztest in a post-order traversal. you can pre-sort the materials in each leaf, and strip/fan all surfaces by material (if possible). state changes are a factor too, and this could obviate quite a few of them. finally, if you don’t care about the draw order, you can merge each leaf’s material list, and batch fans/strips globally by material.