One of the things that I’ve found, which is specific to NVidia, is that I can’t noticably beat out the GLSL compiler. I can tie with it, but that’s all I’ve managed to do. Here’s what I’m doing so you can get a better picture of how the GPU is being used.
Usually, I have 8 textures bound for normal light rendering. This includes 4 textures from the material (bump, normal, diffuse, specular), 1 shadow buffer, 1 light mask, 1 projected texture, and 1 normalization cubemap. Most lighting models are fairly simple, but my most complex one compiles to about 80 instructions. Most of my tests were with the more complex ones, and my test was simply an assembly fragment program using ARB_precision_hint_fastest and a GLSL shader using the half types exposed by the NVidia GLSL compiler. After some pretty exhaustive tests, I found that I couldn’t beat out the GLSL compiler. The best I did was tie it. However, I will say that if I dumped out the assembly, I would sometimes get more instructions from the GLSL shader. However, more instructions doesn’t translate directly into worse performance. My guess is that the output was simply tailored to NVidia’s best performing instruction usage.
Also, I have yet to do any performance tests on ATI hardware, but they seem very confident in their GLSL compiler. That said, I’ll probably see similar results on ATI cards.
All this said, I now believe GLSL is the better option. The reason is that the assembly shaders I’ve written were hand optimized for the card that was in my machine at the time (NVidia). I would bet that an ATI card would run those shaders slower than if they were hand tuned for ATI. However, GLSL takes care of this and compiles down to the best shader for the given architecture (at least in theory, as well as in practice from what I’ve seen thus far). That said, I’m a firm believer in GLSL now.
One thing to note, I will generally prototype shaders in assembly. This lets me see where my clock cycles are going and lets me define how a shader should work in a way that is friendly for the target hardware. Then, I simply do a simple port to GLSL, dump out the assembly, and then compare with my prototype. This lets me catch performance issues that wouldn’t have been obvious otherwise. For instance, a matrix-vector multiply either translates into 4 dp4’s or a transpose (several MOVs) and then 4 dp4’s. Seeing these things in assembly is a great way to catch these problems (I’ve solved things like this by transposing the matrix on the CPU, then reversing the ordering of the parameters passed to the mul() function in the GLSL shader).