Someone has the data, I really don’t care who.
That data is gone. Note that the rendering pipeline goes one way: vertex shader/fixed-function feeds the rasterizer, which feeds the fragment pipeline, which feeds the pixel blender.
In order to do what you’re suggesting (having post-T&L data lying around), the vertex shader would have to be done in software and the data stored on the CPU. This is, quite simply, completely unacceptable from a performance standpoint.
It is currently impossible to simply read back data from a vertex shader and store it to be multi-passed over again. Not only that, as I explained before, you can’t do that, since you need to run the other portions of the shader on each individual set of data.
This means, on the first pass, the HW would interpolate the first 4 UV’s, then on the last pass, interpolate the last 4 UV’s. Not all that complicated.
Then let me complicate it for you.
Let’s assume your equation is the following:
T1 * T2 + T1 * T3 + T1 * T4 + T1 * T5 + T2 * T3 + T2 * T4 + T2 * T5 + T3 * T4 + T3 * T5 + T4 * T5 = output color.
Oh, and the output color will be alpha blended with the framebuffer.
Given a Radeon 8500 (with more blend ops, perhaps), this is trivial; no need to multipass. Given a GeForce3, this is most certainly a non-trivial task in reducing the
Note that each texture coordinate came from a vertex shader program that may have performed similar opertations
It just might be a little slow, worse case.
But it means less work for me, and probably isn’t going to be any slower than any fallback case I would need to write to support that effect anyhow. But I don’t see how it would be slower, since alot of the work isn’t being duplicated anymore.
Multipass == slow. It is far slower than a single-pass hack. I, for one, refuse to use any multipass algorithm unless it produces a particularly good effect (and even then, it had better only require 2 passes).
Not only that, if you’re building your shader relatively dynamically (say, based on slider-bar values or a configuration screen), then the shader ‘compiler’ has to compile dynamically. Splitting a vertex shader into two passes is a non-trivial algorithm. It can, even worse, make the vertex shader even slower.
Not only that, verifying that a shader fits within the resource limitations of the hardware isn’t a trivial task.
Saying that this kind of thing is a relatively easy task that will not impair the performance of the hardware is simply erronous. Besides, I’m more inclined to believe Matt than John Carmack about the potential nightmares of implementing such as system in drivers. Carmack’s job is to get people like Matt to do their work for them.
I think you’re on the right track. While both the NV20/25 and R200 support some kind of loopback to extend their texture stages, one could still argue that it’s just ‘pipe combining’ and you can’t go over your total limit of physical TMUs (of all pipes combined).
That’s the easy way to do it, I believe, but what do I know about these chips, really …
The reason there is a limit to what can be done in a single pass is that there is a limit to how much texture-state information can be stored on-chip. Take the original TNT, for example. It has only 1 texture unit. But it has register space for 2 active textures, which were accessed via loop-back. Most of the time, it is more efficient to store additional register state for active texture objects than to actually have more texture units.
As to why there isn’t more loopback? Simple: register space isn’t cheap. Because the Kyro was a tile-based renderer, it could probably get away with having lots of register space for texture objects per polygon (for some reason. I don’t know enough about the specifics to say why, but given the unorthadox nature of tile-based renderers, I would be willing to believe it).