How many texture units does the radeon 9700 actually have? On their specification page they talk about an 8 pixel pipeline architecture, but when you read the smartshader description they talk about 16-textures per pass.
Where is the catch? Does it run at half speed when you have > 8 textures. (Some sort of dual pass by running trough the pipeline twice.)
It really depends what you mean by texture units. First of all, pixel pipes are orthogonal to (read irrelevant to the definition of) texture units. They simply define how many pixels are being textured/coloured at once. Secondly, when defining a modern texture unit, you need to take into account multiple linked issues. The 3 obvious ones are:
The number of interpolated texture coordinates available to the fragment program.
The number of texture objects (including filter/wrap/lodbias state) which can be bound and looked up by the fragment program.
The number of actual texture lookups which can be performed within a single fragment program.
An example of reasonable values for each are:
8 tex coords (including texgen/matrix support), 16 bindable textures, 32 texture lookup operations.
Core OGL 1.4 has no way of discriminating between these, so I suspect the 8 units exposed by the 9700 is the lowest common denominator…
It’s quite simple, really.
R9700 has 8 pipelines with 1 TMU on each pipe.
It can handle 16 texturer per pass, a requirment of DX9 specifications.
The 8 by 1 description is correct. This means that if you are sampling 16 textures with a linear filter you will run at 1/16 speed ignoring bandwidth and instruction counts.
This really isn’t as bad as it might sound as the amount of bandwidth often will dominate your speed even if the engine could do it in less time.
I’m assuming there’s some close caching per TMU, too? I e, if I stay within some small piece of the texture (8x8 square?) only the first read will eat bandwidth.
Some follow-up questions:
What’s the actual size of the square? I’m assuming this works differently for render-to-texture?
Are there more than one “close cache” location per TMU? I would assume there’s at least 16 of them, if you’re supposed to use 16 textures per fragment. I don’t yet see whether more would help.
Supposing I stay within a texture of, say, 1 row of 8 pixels, is that substantially more efficient than your typical “wide swath” texture?
Though running at “1/16 the speed” sounds really bad, I’d like to point out that there is some performance penalty associated with each addition opcode (either vertex or fragment) that is applied to a shader. I wouldn’t look at performance in terms of “how many hardware texture units are there?” but in terms of “how many opcodes does the shader take?”
I’d like to bump my questions about the behavior of “close” texture caching on these cards. I’ve gotten some insights by just measuring different carefully constructed cases, but I’d be much happier with authoritative answers…