Fragment shader optimization tips needed!

My (simple) game engine has one big fragment shader for the whole scene. Right now, with multiple objects, collision detection and bloom post-processing (4 linear passes) I get frame times of ~18ms at 1080p on my AMD RX550 (2GB) - which is quite okay for me.

But I think that I can squeeze more fps out of my setup. Since increasing the window size dramatically influences the frame times, I figure that the issue might be the fragment shader.

It works a little bit like this:

main()
{
    outputcolor = objectbasecolor; // from uniform
    if(switchForTextureDiffuse > 0) // >0 means "use a texture"
    {
        outputcolor *= texture(uniformTexture, vTextureCoords);
    }

    // the same goes for normal map, specular map and emissive map
    // with the needed calculations only being done if a switch
    // (i.e. switchForNormalMapping > 0) is set to 1.

    for(int i = 0; i < lightcount; i++) // lightcount is a uniform as well
    {
        // do light calculations (dot product, use light color, calculate falloff, etc.)
        // and add the illumination to the color output
    }
}

Now, I know that if/else is no good idea, especially in a fragment shader. But I do not want to rebuild this shader completely from scratch only to find out that the ifs/elses did no harm to my frame times at all.

From your experiences, what really makes the frame times be better?

  1. Use texture compression (like DXT1, DXT5) instead of just uncompressed bitmaps?
  2. Get rid of all ifs and elses by creating a lot of different shaders (like one for only diffuse and normal maps and another shader for using diffuse, normal and specular maps?
  3. Reduce the number of frame buffers for bloom effect (right now I have the normal framebuffer, one for vertical blur and one for horizontal blur) by creating more texture attachments for only one framebuffer?

Point 2 would be a really great heap of work, so I need to be sure that this really has an effect on my frame times. Also, I would have to switch GL programs a lot, because not all of my models have normal or specular maps. And I read that switching render programs also takes a lot of time.

Cheers and thanks for your inputs!

P.S.: I know deferred lighting would also be a huge performance increase if there are many light sources in the scene, but the number of lights will not be > 3.

It shouldn’t matter providing that the condition is uniform.

This could help if memory bandwidth was the limiting factor. You can test this by using glTexParameteri(GL_TEXTURE_LOD_BIAS); if a small change to the LoD bias results in a change in frame time, memory bandwidth is a limiting factor.

Probably insignificant. Also, you don’t necessarily need the conditionals; if you want a constant colour, you can just use a 1x1 texture. OTOH, there’s no point calculating the TBN matrix if you aren’t using a normal map.

GClements, thank you very much for your reply.
I did not know that GL_TEXTURE_LOD_BIAS can be used to check if memory bandwidth is an issue. That’s great advice!

But your reply also leaves me a bit confused, because my card can run 2016’s Doom on 1080p on medium settings with ~40fps - a game that has significantly more dynamic shadows, more polygons, more moving objects in screen space, more particle effects, DOF and bloom post-processing, more everything.
And my scene is so basic (three dynamic lights, two of them casting shadows) and very simple geometry (see screenshot) and I only get ~15 fps more. I know, these AAA titles are optimized to the brim, but my game is simpler than any game released 10 years ago. I assumed that I’d have > 200fps, to say the least.

This doesn’t mean anything, as FPS change varies depending on what your base FPS is. That’s one reason game developers do not talk in terms of FPS, but rather milliseconds. Check these out (a few of the many blog posts and articles on this topic):

It’s going to depend on your specific game and what its primary bottleneck is.

Add some switches to your app to switch on/off specific stages or features in your pipeline (textures, complex shaders, state change groups, whole database layers, etc.) and see which one makes the biggest change in your frame time (not FPS!). Go after that one. Optimize it. And rinse/repeat until satisfied.

The one thing you won’t easily be able to test with this is the potential benefit of better batching (fewer draw calls), so just keep that in mind. Though disabling the state changes between them will give you a clue.

Also keep in mind that while some of your bottlenecks may happen steady-state (i.e. every frame, like clockwork), some of them will be pop-up bottlenecks instigated by some irregular task like texture uploading. Those often spike your frame time high for one frame and are the biggest hit to the user’s experience with your game (a stutter makes a game feel like garbage). Don’t neglect those! If your game doesn’t render butter-smooth at whatever FPS you’re targeting, you’ve got a bottleneck to isolate and get rid of.

Ok. Why do you only call out the fragment shader here? What about the vertex shader(s)? How many vertex shader changes are there per frame? Are you using separate shader objects (if so, you could be leaving performance on the table).

And backing up a step, your mention of fragment shader suggests that you think you are fragment limited. Have you tested that? Do you see a roughly linear decrease in frame time with reduced pixel count?

Which of your 4 passes consumes the most time? Go after that one!

In terms of shader performance, one thing to be cautious of with ubershaders like this (one big shader, with run-time evaluation of conditionals rather than compile time) is that the shader has to consume the worst case number of shader core register slots for each shader invocation. That means fewer shaders can run in parallel on the shader multiprocessors. Less parallelism = less latency hiding potential = more chance that your shaders will stall the compute units on memory accesses = potentially lower triangle/pixel throughput during rendering. You might compare performance against a test case where you turn the terms in your conditional expressions from "uniform"s to "const"ants. If AMD’s GLSL compiler is like NVIDIA’s, this’ll cause a lot of dead code elimination, fewer total shader registers consumed, and smaller/tighter shaders with better performance potential.

This isn’t what you asked about, but be careful about cases like this where you’re doing filtered texture lookups in conditionals. The derivatives are undefined.

Thank you for your detailed answer. I see that I need to learn so much more about how OpenGL and computer graphics as a whole work.
Maybe I have really bad code that slows everything down.

Right now, it works like this:

glBindFramebuffer(..., #id of multisample framebuffer); 
foreach(GameObject g in ListOfGameObjects)
{
    // first: glUseProgram(..); // may be #1 or #2 (I have a shader with no light calculations and one with everything)
    // then: upload matrices (normal, model-view-projection and model matrix) to shader with glUniform...
    // then: upload shadow mapping texture to shader
    // then: upload light positions and colors to shader
    
    foreach(Mesh m in g.Meshes)
    {
        // then: upload mesh texture and other specifics
        // then: draw call
        // then: unbind textures with glBindTexture(GL_TEXTURE_2D, 0);
    }
    // glUseProgram(0);
}

downsampleMultisampleFramebuffer(); // this blits both color attachments (#0 contains scene, #1 contains pixels to blur later) to a single sample framebuffer (also with two color attachments)
applyBloom(); // 4 linear passes with a small kernel of ~5 texture offsets for each pass, bloom texture is 1/4 of the window resolution
// applyBloom() writes the final iteration to framebuffer #0 for screen output

I do not know if this is insane or quite okay. I don’t know if I really have to bind and unbind textures this often and if a call to glUseProgram() really is that expensive.

I will read myself through the websites you posted. I will post an update when I find out more.

But one final question:

The derivatives are undefined.

What does that mean? And what are shader objects?

P.S.: The frame times increase with increasing resolution. That’s why I thought the fragment shader might be the issue.
The fragment shader code is at https://github.com/KWEngine/KWEngine2/blob/master/KWEngine2/Shaders/shader_fragment.glsl

They are only undefined in non-uniform control flow. If the condition is dynamically uniform, then the control flow is uniform and the derivatives are fine.

He’s referring to this.

Ah, I see. That might be the cause of a strange 1282 error I had after calling glDrawElements().
Turned out, I did not always provide the shadow mapping texture. Nvidia and Intel GPUs ignored this, but on my Radeon card, the scene was completely black and the error showed up.

No, it isn’t. You don’t get INVALID_OPERATION errors for missing textures.

Hmm but after binding the texture correctly the error was gone. And it was only on Radeon cards. Nvidia and Intel gave no error and rendered everything as usual… Man, I really don’t know much about that. :wink:

This isn’t related to binding the texture per se, but note that draw calls can generate GL_INVALID_OPERATION if you have sampler uniforms of different types (e.g. sampler2D and sampler2DShadow) which refer to the same texture image unit.

glValidateProgram can be used to detect this. glLinkProgram can’t detect it because it’s an issue with the environment in which the program is executed rather than with the program itself.

GClements, this (or something like this) must have been the case! I am quite sure that I did not use a specific texture unit twice (like texture unit 1 for both a sampler2dshadow and sampler2d) but I had some texture reads inside the shader that were executed only if a certain integer was set to 1. If (on the C# part of my application) I was sure that this integer will be 0, I did not bother uploading the texture for these texture reads. This somehow resulted in error 1282.
As soon as I bound the texture for both cases, the error was gone. Maybe it was that way because the texture reads were put in a separate method?
Anyway, I will try and rebuild my shader from scratch and put in all of your suggested changes. Thank you all for your very helpful comments!