How expensive are several passes, when rendering?

HFAFiend · April 7, 2001, 10:07am

How expensive are several passes, when rendering? I know that it will take some time, but is it not enough to worry about? I ask this because there are times in my engine where textures could be combined via cpu, then copied back onto the video card, once every frame (for only a handfull of textures, which are used on many faces each). So, in other words, would it be more expensive to do 2-3 passes/poly, or to blend the textures in cpu (if each texture is re-used at least 10-15 times).

DFrey · April 7, 2001, 11:02am

I would likely choose the multipass route myself.

zed · April 7, 2001, 5:34pm

i cant really see it being faster doing the blending yourself in software. except maybe if u have a 1ghz cpu + a pre 1998 graphics card . sure its gonna hurt doing multiple passes but the card will only have to transforms the vertices once. 2-3 passes per polygon. which means 1-2 passes if you use multitexture

Won · April 7, 2001, 11:56pm

Actually, in a multipass implementation, the vertices will have to be transformed for each pass. Of course, if you can collapse multiple passes with multi-texturing then its a different story. Anyway, this is probably not a big deal unless you are severely geometry-bound.

One case where the CPU blending would be better is if the blended textures were used very often and the blending relationship between the textures changed infrequently. That way, you can blend once (on the CPU) and simply draw the pre-blended texture with a single pass.

Even in this case, however, you could probably accelerate the blending computation by performing it in hardware and using CopyTex(Sub)Image.

–Won

Don_t_Disturb · April 8, 2001, 7:42am

Depends on your fill rate and the number of pixels you’re drawing.
My terrain renderer (3 passes) flies at 640480 with a GeForce256, but the frame rate drops massively once I go full-screen (1280960).

imported_jwatte · April 8, 2001, 10:15am

> Actually, in a multipass implementation,
> the vertices will have to be transformed
> for each pass.

That is not true if you’re using the compiled vertex array extension (LockArrays). The whole point about LockArrays is that you tell GL you won’t change the data, so it can go ahead and transform the vertexes once, and all subsequent calls to Draw{Range}Elements will use the same cached copy of the transformed vertexes. If you do multi-pass, this is crucial.

I would think that you have to use multitexturing if you want competetive frame rates. There is a noticeable difference between two passes times two textures, versus four passes of one texture each (at least on my GeForce2 GTS, and on my P.O.C. Vanta). I’ve also gotten in the habit of changing the depth func to EQUAL and turning off the depth mask when doing the consecutive passes; I believe the former is paranoia and the latter speeds things up.

HFAFiend · April 8, 2001, 3:19pm

jwatte is right, but actually the video card usually stores the results of the transformations for something like the last 8 vertices…meaning that the transformation part of it will almost be inconsiquential…I think I’ll make it do several passes, and try doing it via cpu only if it is slow enough that I think i’ll get some advantage.

Won · April 8, 2001, 4:47pm

Actually, I don’t think that Jwatte is entirely correct. True, cards like the GeForce have a post-transform vertex cache, but this is for when you use indexed vertex arrays, and you address similar indices frequently. CVAs (compiled vertex arrays) do not store the post-transformed vertices (at least no implementation that I am aware of). The point of CVAs is to “lock” regions of vertex arrays into vid/AGP mem so you don’t have to do a driver copy for each call to DrawElements.

–Won

imported_jwatte · April 8, 2001, 6:58pm

Actually, I think I’m right. Locking user memory for DMA and building scatter/gather tables is done with NV_vertex_array_range, and allocating special AGP memory to make that transfer more efficient is done with AllocateMemoryNV.

Think about it:

LockArraysEXT showed up WAY before there was any hardware T&L cards on the market. And still, HT&L is much more uncommon than you’d hope (*). For cards that don’t have HT&L, locking/mapping the user’s vertex buffer is useless, but they still get sped up by LockArraysEXT. Indeed, for cards which do have HT&L, I would assume LockArraysEXT is a no-op, or possibly just copies the array data into some pre-allocated AGP buffer for faster transfer whenever the indexes do get drawn. The big win with LockArraysEXT comes when you do multi-pass rendering on non-HT&L cards, which seems to indicate it allows the driver to pre-process and cache the transformed vertex values (unless you change GL transform state after calling LockArraysEXT, which would be a bad idea).

Maybe someone from nVidia or ATI can bring less murky light on the matter?

(*) I was at Best Buy today. Of the desktop machines they had on display, 8 had built-in i81x grapics, 3 had built-in TNT2 Pro, and 1 had a GeForce2 MX. Sad.

Won · April 8, 2001, 7:30pm

Well, I guess my facts were coming from the NVidia implementation (as outlined in their now-outdated performance FAQ) which basically says that locked CVAs are copied to AGP memory and subsequent calls are pulled from there. Reading the actual spec of the extention, it seems that you could be correct. It does seem rather plausible for software T&L pipelines, as you pointed out.

Naturally, VAR is far more flexible and apparently higher-performance than CVA (without considering the alleged post-transform caching).

–Won

harsman · April 9, 2001, 12:14am

I think CVA’s will make vertices be transformed only once provided you’re using SW T&L. If it’s in hardware it’s probably pipelined anyway so I doubt there would be any point in only doing it once. Besides, there’s probably nowhere to store the vertices once they’re transformed (to device coordinates or whatever).

imported_jwatte · April 9, 2001, 8:04pm

There is no “alleged” post-transform caching for HT&L cards (except for the documented 10-vertexes deep transformed vertex cache FIFO on GeForce hardware).

If you have the choice between CVA and VAR, go for VAR (and make sure you write to the memory correctly). They are different extensions for different situations (software vs hardware T&L pipelines). They clearly won’t work optimally if used under other conditions than what they were designed for.

The fact that CVA copies data into AGP memory for HT&L cards seems like a nice trick to get a few extra FPS out of non-HT&L-aware games. If you’re aware of VAR, and it exists, that’s the way to go.

[This message has been edited by jwatte (edited 04-10-2001).]