shaders performances

Does anybody know if the GeForce driver would give different performances for a similar shader if I’m using any of those 3 shader’s technique:

  • ARB extension
  • NV extension
  • CG

I only tried the ARB extension and I get crappy performances. Do you have an idea why?

Thanks

ARB forces GF to use fp32 in all calculations, NV allows you to specify which precision you need for a given task fp16/fp32 or int. And cg is just a HANDY high lavel language that allows you to compile high lavel code to gpu asm. Using cg you can compile to ARB/NV exts. Sometimes cg can give you better performance than handwritten shaders because it tires to optimize code for given ext, instruction parity, reg usage etc. But the only fast shader path for GF is NV, and even then you must specify which precision mode you’ll use in which calculations.
R’s are faster than NV’s in ARB path (fp24 vs. fp32)
R’s on ARB vs. NV depends on your shader.
ints are fast, fp16 fast, but a lot of f32 slow. Replace 2+3 from f32 calc to int calc and you’ll feel the difference.

The GeForceFx is very sensitive to register usage. If you use more than two 16 bit fp registers or one 32 bit fp register performance tends to suffer. I’m not totally sure on those exact numbers, it has been a while since I looked at the performance numbers, but in any case fewer registers were much better. Using 16 bit fp with the nv extensions thus gives you twice the amount of register space which helps performance.

Have you tried setting the precision hint with ARB_fragment_program to fastest? I’m not sure it actually does anything but it might. Of course you can also move calculations to lookup tables/textures to decrease the amount of arithmetic that needs to be done.

Last I heard it was that for every two FP32 registers or 4 FP16 registers you used, there was a performance hit. The hit going from 2 FP32 to 4 (or 4 FP16 to 8) is relatively small, but beyond that it gets quite significant.

Cg, by default, compiles code to use a minimal amount of registers. The result is a bit longer shader code than one would be able to get using more registers, but with the GeforceFX the performance hit for those extra instructions is less than what it would be if more registers were used.

In other words, using cg would improve the performances on my GeForce FX, but the same same shader would have degraded performances on a Radeon because it doesn’t have the Register limitation (or at a different scale). Do I get it right?

@M/\dm/…:

>>cg you can compile to ARB/NV exts.

is it able to get the “compiled code” out of the cG compiler ? (after compiling to the desired “goal-shader-platform”)

Yes, that’s possible. You can use the command line version of the Cg compiler or even the Cg runtime to get access to the generated shader assembly code.

Originally posted by vince:
In other words, using cg would improve the performances on my GeForce FX, but the same same shader would have degraded performances on a Radeon because it doesn’t have the Register limitation (or at a different scale). Do I get it right?

If a shorter shader can be written using more registers: yes. If there’s some way to get the Cg compiler to generate shorter code at the expense of more registers, you could do that and use it as a separate render path meant for ATI cards.

Something else that is likely to cause problems for the Radeon with GeForceFX optimal code is that reusing the same registers over and over can lead to what looks like dependant texture lookups although they aren’t. The radeon has a maximum dependency depth of four so it might refuse to run the shader if this is the case. This might be fixed with newer drivers though I haven’t done any tests lately.

The radeon doesn’t handle arbitrary swizzles natively either so if the cg compiler gets a little swizzle crazy you might end up with the radeon driver emulating those with multiple instructions.

Originally posted by harsman:
The radeon doesn’t handle arbitrary swizzles natively either so if the cg compiler gets a little swizzle crazy you might end up with the radeon driver emulating those with multiple instructions.

Texture sampling too, the Cg compiler keeps doing tex instructions with multiple instructions following and then ANOTHER tex instruction(s) set, which of course doesn’t work well on the 9700 (and kind of makes Cg pointless). I know saving registers is important, but so is texture sampling! Gah.

Well, at least the ATI driver can probably do some instruction reordering to prevent stalls from happening when you get an unbalanced mix of arithmetic and texture instructions. I haven’t tested this but theoretically it shouldn’t be hard.