CGSL speed issues

Hello everyone!
please help me on this one, since i’ve been breaking my head for quite a while now… :slight_smile:
i tried on gamedev but not much help there…

i am working on a game right now, i have added recently per-pixel-lighting using CGSL.

It looks great, but i have one big problem - speed…

before i tell my specs i must say that Doom3 runs pretty fluid in my computer.

my specs are:
1.5 GHZ cpu
memory - 384 mega
nVidia FX 5900

what i do is much like doom3:
i have diffuse map, normal map and specular map.

currently i am rendering a checkers board model and 26 quake3 models (with auto generated normal and specular maps).

my FPS is ~30 right now.

if i change my Fragment-Program to only do the most basic thing (just display the diffuse texture) i get ~50 FPS.

i followed this diagram:
and it seems that i am FP limited.

here are the FP / VP i use

does anyone have any idea how to change/optimize the FP to make it faster??

i tried using halfs instead of floats inside the FP and it didn’t change anything.

thanks for your time. :slight_smile:

ps - if i render everything without the use of CGSL (simple base textures) i get >100 FPS.

Have you tried the same shader using GL_fragment_program or NV_fragment_program? That way, you can adjust the assembly to see what’s actually costing you. I suppose you could do a bit of that with the CGSL, too; take out one line, then the next line, …

bah, i was afraid the solution will be something like that lol :slight_smile:

thanks alot for your reply!

if anyone here knows other good ways please assist me on this one :wink:

On the GeForce FX hardware it helps a lot to normalize your vectors in the fragment shader using a cubemap rather than math. Now this isnt true for every other GPU out there that supports fragment shaders ( radeons, geforce 6 series) as they do better with math. Now of course you will lose a little bit of quality but it’s not that bad.

Remember to sort from front to back too. :slight_smile:

One thing I think is kinda cool is that you can implement a debug util that will show you how much overdraw you’re getting by using the stencil buffer and alpha blending. Having a lot of overdraw using these fragment shaders will kill you. I don’t know if this is a problem with your app but it’s a good thing to keep in mind.


Here are a few suggestions I have for optimisation:

Firstly use the command line Cg compiler. Compile your code and look at the code generated. You can then start playing around with your code to try and reduce the number of instructions which generally is a good thing. You will find being clever with swizzles, or not using swizzles can actually reduce the number of instructions.

If you want to take advantage of half and fixed types you have to compile to the NV profiles. There is now a precision option for ARB_fragment_program but you will have to look in to this to see how useful it is (the latest Cg compiler supports this). Compiling to the NV profiles allows you to see each instruction and whether fixed or half or float precision is being used. Sometimes you have to do an explicit cast to convince the compiler to use the desired precision. Your extraction of the bump normal would be an example where you need to cast. In my experience the difference between halfs and floats is roughly double.

You can also try the NVShaderPerf utility to give you an indication as to how efficient your shader is. Unfortunately I haven’t actually found any useful guides to actually diagnose and improve shader efficiency. I do know that nVidia cards like texture and math instructions to alternate in some way but that’s about the best guide I have.

As SirKnight mentioned using normalisation maps is more efficient on FX cards than maths with the caveat that if adjacent fragments can have wildly different vectors (a very bumpy surface for example). It may still be better to normalise with maths due to texture cache misses.

If you have a lot of overdraw you will probably benefit from laying down a Z buffer before doing any drawing. Also this path is rendered at double speed on NV cards if certain conditions are met. After you lay down the buffer remember you can then turn off Z writing.

Lastly a warning about Cg. I have found that the Cg runtime is really quite inefficient when it comes to being nice to your processor (I have no idea if this has been addressed in the latest release). You need to reduce the calls to Cg as much as possible. If this isn’t currently a problem then cool, but I thought it is better to know early.


Speaking of which, what is the current wisdom on whether Cg will be around for the long haul? It seems with HLSL and GLSlang, it’s getting quite squeezed. And it went a looong time without any updates.

(and if anyone has written a GLSlang equivalent to ID3DXEffect, please let me know :slight_smile: