Fragment shader optimization tips needed!

This doesn’t mean anything, as FPS change varies depending on what your base FPS is. That’s one reason game developers do not talk in terms of FPS, but rather milliseconds. Check these out (a few of the many blog posts and articles on this topic):

It’s going to depend on your specific game and what its primary bottleneck is.

Add some switches to your app to switch on/off specific stages or features in your pipeline (textures, complex shaders, state change groups, whole database layers, etc.) and see which one makes the biggest change in your frame time (not FPS!). Go after that one. Optimize it. And rinse/repeat until satisfied.

The one thing you won’t easily be able to test with this is the potential benefit of better batching (fewer draw calls), so just keep that in mind. Though disabling the state changes between them will give you a clue.

Also keep in mind that while some of your bottlenecks may happen steady-state (i.e. every frame, like clockwork), some of them will be pop-up bottlenecks instigated by some irregular task like texture uploading. Those often spike your frame time high for one frame and are the biggest hit to the user’s experience with your game (a stutter makes a game feel like garbage). Don’t neglect those! If your game doesn’t render butter-smooth at whatever FPS you’re targeting, you’ve got a bottleneck to isolate and get rid of.

Ok. Why do you only call out the fragment shader here? What about the vertex shader(s)? How many vertex shader changes are there per frame? Are you using separate shader objects (if so, you could be leaving performance on the table).

And backing up a step, your mention of fragment shader suggests that you think you are fragment limited. Have you tested that? Do you see a roughly linear decrease in frame time with reduced pixel count?

Which of your 4 passes consumes the most time? Go after that one!

In terms of shader performance, one thing to be cautious of with ubershaders like this (one big shader, with run-time evaluation of conditionals rather than compile time) is that the shader has to consume the worst case number of shader core register slots for each shader invocation. That means fewer shaders can run in parallel on the shader multiprocessors. Less parallelism = less latency hiding potential = more chance that your shaders will stall the compute units on memory accesses = potentially lower triangle/pixel throughput during rendering. You might compare performance against a test case where you turn the terms in your conditional expressions from "uniform"s to "const"ants. If AMD’s GLSL compiler is like NVIDIA’s, this’ll cause a lot of dead code elimination, fewer total shader registers consumed, and smaller/tighter shaders with better performance potential.

This isn’t what you asked about, but be careful about cases like this where you’re doing filtered texture lookups in conditionals. The derivatives are undefined.