Fragment Shader Optimization


I have a fragment shader I’m trying to squeeze better performance from. I have tried adding an if statement around a large block of code, but oddly enough it has no effect. The if statement works from a visual point of view, but in terms of performance, whether the statement is true or false makes no difference.


In my experience, the best way to optimize fragment shaders is to offload whatever possible to units that don’t have a lot of workload. This means moving linear calculations to the vertex shader to be interpolated accross the face of the triangle for immediate use in your fragment shader. This means using a normalization cubemap instead of an RSQ MUL. I’ve found a normalization cube map lookup takes about 3-5 clock cycles and happens in parallel with ALU operations (as long as they don’t depend on the result of the lookup). Also, you’ll pay a price if the read is dependent. For that reason, I’d recommend only using a normalization cube mapon any vectors you can interpolate across the face of each triangle (light vector, eye vector, half vector, etc). Have fun and good luck! =)

Kevin B

Also, about the if statement, what card are you working on? On NVidia, the card effectively renders pixels four at a time in 2x2 pixel squares. In dynamic control flow situations, all four pixels will take as long as the longest pixel. If one of those four pixels takes a branch, and the other three don’t, then the other three will simply NOP until the branch completes. They then begin to execute in parallel again. That said, if you’re using dynamic control flow for something such as stippling, you’ll see little to no gain, and possibly even a loss. That said, it’s best to only use control flow when you know a large number of fragments in close proximity to eachother will be rejected at the same time (branching on attenuated+shadowed light color for multipass lighting comes to mind).

Kevin B

Another hint is for ATI cards - use the ‘early z-out’ feature.
In other words: render polygons with most complex shaders at the end. You may also render polygons starting from nearest and ending on those far away - this will also let early z-out do it’s job.

And about moving code to vertex shader - it’s a good approach, but be careful not to run out of interpolators. This happened to me on ATI card - I had 8 varying variables passed from vertex to fragment shader and gl_FragCoord used in fragment shader. Shader was compiled and linked with no errors/warnings but it didn’t work (I assume it’s a driver bug).
Try not to use more than 8 interpolators - you never know what combination of card/driver will end-user have.

Look at this topic if you want to know what exactly happened with those interpolators on ATI:;f=11;t=001042

What card are you working on? NVIDIA cards can happen to ignore if’s completely (especially if the block of code is not too large) since the thread size is big on these cards (about 30x30 on GF6 and 10x10 on GF7). ATI X1K cards, on the other hand, tend to use if() instructions even with small blocks of code to exclude, since they have 4x4 thread size (ATI x1800). And I have indeed experienced a pefrormance gain from this (I have been doing a ray-triangle intersection, and used and if() block to provide an early-out; this improved the performance of the entire shader by 10%).

However, recently I have encountered a phenomenon I can hardly explain. There was a large (hundreds of lines) block of code in the shader, and an if() around it. By all common sense, the compiler should’ve inserted that if() to avoid such a large block of code whenever possible. However, with the latest 6.4 Catalyst release, it hasn’t done so. I wondered why such a horrible performance loss occured, and, by commenting out the if’s, I understood that the compiler simply ignored them (or mapped them to predication, which is also possible). Something like #pragma optimize(off) did not help, and I’ve written a letter to ATI (the shader code - and that is some 400 lines - included). I haven’t got any answer :frowning:

Thanks everyone for your help and suggestions. As it turns out, I have spent a fair amount of time replacing if statements with preprocessor #ifdefs, or with mathematical logic. It turns out that I was able to remove every single if statement from both my vertex and frag shader and have seen huge speed improvements. The trick to making this effective is recompiling the shader any time a key option is changed, which is pretty easy to do in my case.

One of the biggest ways this helped was in setting the max number of lights for the light calculation loop. Previously I was passing this value in as uniform, looping through a fixed number, and testing the light visibility for each. It is -so much- more effective to simply define this value as a constant using the preprocessor.

I have a Quadro FX 3450, for those that still wish to know.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.