ARB_vp/fp optimization

So, I’ve written an ARB_vp/ARB_fp rendering path for the lights using 1 pass for everything, but it’s abit slower than the NV20/reg.combs path (using 2 passes). The renderer I’m working on is sort of doom3:ish, i.e. lots of operations on the pixel level, and not that high-poly.

My question is, what is the thing that has the highest impact on performace in ARB_vp/fp, is it the instruction count?
Or the instruction type, for example, DP3 is slower than MOV, is it bad to use too many TEMPs (even if they are within the native TEMPs supported)?

Any performance hints appreciated.

Chris

The biggest thing I’d say is using more registers than you really need to. Since you are using the ARB version of fp then I can’t tell you to use the half operations rather than the full float like I would if you were using NV_fp. But, as far as NVIDIA cards go when using ARB_fp, register use (how many used) plays a big roll in performance. Also you could try to use the hint that suggests faster performance. I forget the exact name and I don’t even know if it’s in the spec. Although one would think it would be. I don’t know if that will help though. From what I have heard (I have not benchmarked it yet) that with NVIDIA’s newest drivers, using 32bit and 16bit floats pretty much make no difference in speed. That’s just what I have heard so don’t hold me to it.

Also try to eliminate as many instructions as you can, obviously. Like if you can do MUL <out_register>, blah, blah2 rather than MUL <register>, blah, blah2; MOV <out_register>, <register>

Of course you probably know that.

As far as vertex programs go, those run great on any card so again just register count should be limited.

Even though some would rather not do this, I would still make separate code paths for NVIDIA and ATI. This way I can squeeze the most of out each card, which is what the hardware vendors want. Sure it may take little more time but it’s worth it to me.

EDIT: I went ahaid and looked at the spec of ARB_fp to find those hints and here they are. “ARB_precision_hint_fastest” and “ARB_precision_hint_nicest”

-SirKnight

[This message has been edited by SirKnight (edited 10-31-2003).]

Oh I just thought of something else, lol.

NVIDIA cards favor long frag programs, they love the hell out of them. Yet ATI favors short frag programs, but hates long ones. Maybe a good thing to know.

-SirKnight

Originally posted by SirKnight:
The biggest thing I’d say is using more registers than you really need to. Since you are using the ARB version of fp then I can’t tell you to use the half operations rather than the full float like I would if you were using NV_fp. But, as far as NVIDIA cards go when using ARB_fp, register use (how many used) plays a big roll in performance.

It doesn’t count anymore. 52.16 is doing a hell of a job in optimizing register usage. I did several tests. For example, I had a fp with 20 temp regs. With earlier driver the performance was … hm - it wasn’t a performance With 52.16 it run 120 fps - 30 more as the hand-optimised shader. So you musn’t care for nvidia register performance anymore.
Very important hint is SIMD. For example: if you have to compute several scalars with more or less the same algorithm(like in sclicks model) you can do it in parallel with several vector ops.

Maybe I’m goind to write a FAQ: writing fast shaders 8) someday

My personal advice: use NV3x extensions if you’ve got an Nvidia card.

As said SirKnight is a big difference between FP on NV & ATI - for complicated code you need a separate programs for both architectures :-(.
I agree with Zengar - multipath ARB_fp for ATI & NV_fp for NV…

And with Det 52.xx it’s almost no diference between fp16&fp32(sure, depends form program).