fragment program performances

Hi,
I’m trying my fragment program on my brand new GeForce FX 5600. I’m a bit disapointed at the performance I get since my program is quite simple.

For exemple, here is an extract of my code. Running it costs me about 10Hz.
TEMP textOut0;
TEX textOut0, fragment.texcoord[0], texture[0], 2D;
TEMP textOut1;
TEX textOut1, fragment.texcoord[1], texture[1], 2D;
ADD textOut1, textOut1, -0.5;
TEX textOut0, fragment.texcoord[0], texture[0], 2D;
ADD textOut0, textOut0, textOut1;

If I replace the last line by :
ADD textOut1, textOut0, textOut1;

then there is about no performance drop (it’s cheap, but unfortunately, it doesn’t do what I want).

The difference probably has something to do with registers or something like that. But I don’t quite understand exactly. Nailing that one down could probably help me to set up all of my fragment and vertex programs properly.

Thanks for the help in understanding this issue.

What does your first texture fetch do? As far as I can tell, nothing…

Also, you probably have a write-after-write hazard dependency in that last ADD line. I don’t see you moving to the fragment output color at all – where does that happen?

But the drivers should be optimizing the code shouldn’t it? If it wasn’t possible then there isnt much you can do.

Having the entire code would be better and a few numbers. What’s the diff between case 1 and 2?

I would not count on the driver removing any of your instructions.

As I said, the difference between case 1 and case 2 is a write-after-write hazard.

You still haven’t showed how the color goes into the output register.

Most shaders you see will TEX as early as they can, as much as they can, and then start doing math on it. The tex results needed first should be TEX-loaded first. This may give you a little bit of extra latency hiding.

Originally posted by jwatte:
Most shaders you see will TEX as early as they can, as much as they can, and then start doing math on it. The tex results needed first should be TEX-loaded first. This may give you a little bit of extra latency hiding.

Well, there are exceptions. ATI’s Radeon 9500 and above can execute a texture lookup and a vec3/scalar ALU instruction pair in a single clock cycle. So for ATI cards, at least, pairing up texture lookups with an ALU instruction is actually a very good idea.

Of course, Vince is using a GeforceFX, so this doesn’t help him. . .

The driver could tranform the original into this : notice textOut2

TEMP textOut0, textOut2;
TEX textOut0, fragment.texcoord[0], texture[0], 2D;
TEMP textOut1;
TEX textOut2, fragment.texcoord[1], texture[1], 2D;
ADD textOut1, textOut2, -0.5;
TEX textOut2, fragment.texcoord[0], texture[0], 2D;
ADD textOut0, textOut2, textOut1;

and you wouldn’t notice a difference.
I think it’s a pretty obvious situation.

The OP should post some FPS for case 1 and case 2 for completeness.

[This message has been edited by V-man (edited 05-24-2003).]

V-man: Adding more temporaries is very expensive in fragment shaders. I think every 2 temporaries is another step down in performance (only the first 2 are “free”) on the GeForce FX, at 32 bit precision. If the driver does this transform, perhaps that’s where the slowdown happens.

Anyway, the first TEX is still not used, even in your “optimized” shader. And I still don’t see any code that writes to the output color.

Even if the R300 architecture can multi-issue texture fetches and ALU instructions, that doesn’t mean that the texture fetch instruction will run in a single cycle.

Perhaps it will if it’s a cache hit? Any good data on this? How big’s the cache line and what’s the geometry? For what filtering modes?

Even ATIs sample shaders put tex loads first. They try to interleave vector and scalar operations, though; probably because they still optimize “dual issue” from DX8.1.

Back to the original question: what you really want is (texture[1]-0.5)+texture[0], right? How about this shader:

PARAM bias = { -0.5, 0, 0, 0 };
TEMP temp1, temp2;
TEX temp1, fragment.texcoord[0], texture[0], 2D;
TEX temp2, fragment.texcoord[1], texture[1], 2D;
ADD temp1, temp1, temp2;
ADD result.color, temp1, bias.x;

Even if the R300 architecture can multi-issue texture fetches and ALU instructions, that doesn’t mean that the texture fetch instruction will run in a single cycle.

Obviously, the speed of the texture fetch opcodes are dependent on the speed of memory accesses. The assumption in looking at fragment shader performance is that this operation takes the bare minimum time. At bilinear, on most hardware, that’s 1 cycle. Trilinear probably pushes this up to 2. Anisotropic, especially ATi’s adaptive, could change depending on teh fragment.

As for fragment program performance on GeForce FX hardware, you will need to use NV_fragment_program coupled with switching to 16-bit floats in order to get better-than-ATi speeds out of the thing.

Well, I’ve been sent in a field trip shortly after I posted, so I’m kind of late on the topic now.

ok, first I’ll start that way. The problem I was first talking about is because I wasn’t reusing textOut1 at all. Apparently, if we compute a label and we don’t use the result, all the line using that label aren’t executed (that’s just an assumption, but that seems reasonable with the test I did). This is probably a driver optimisation

I did some more benchmarking. When no shader is used, I run my scene at 142 Hz. The same scene, done with a shader that does the same thing (same expected results than without the shader), I drop at 113Hz with one texture and to 69Hz with 2 textures. That seems a very significant drop from 142 to 69Hz for a very simple program.

I put the code I’m using for the fragment program. I’d be curious to know why I got such a big performance penalty, knowing the GeForce can fetch 2 textures at the same time (is it also true for the shaders?).

!!ARBfp1.0
ATTRIB col0 = fragment.color;
TEMP OutColTmp;
TEMP textOut0;
TEMP textOut1;

TEX textOut0, fragment.texcoord[0], texture[0], 2D;
TEX textOut1, fragment.texcoord[1], texture[1], 2D;
ADD textOut0, textOut1, textOut0;
ADD textOut0, textOut0, -0.5;

MUL result.color, col0, textOut0;
END

FYI, I just trying the same thing on a Radeon 7700 and I get 188Hz with no shaders and 184Hz with the same program as above.

Vincent

Simple, maybe stupid suggestion: what size are your textures ? Try to reduce them to a small, very small resolution (like 32x32) to see if it makes a difference in your performance drop…

Y.