Deferred Shading Performance Problems

If you’ve got a bunch of calls to an API which does so very, very little? Oh yeah! You want to minimize your state change API calls.

Plus as aqnuep pointed out, being limited to only 8 light sources per batch is not good if you’re talking potentially a lot of lights.

Plus some of those fixed-function light related calls potentially direct the driver to recompile, relink, and reupload shaders under-the-hood while you’re rendering – not pretty. You’re at the mercy of the driver for how many shader permutations it keeps track of and avoids rebuilding.

Shader-based lighting gives you a lot more control and efficiency potential.

[QUOTE=Dark Photon;1239095]I betcha you have and have just forgotten :wink:

Remember GLSL pre-1.2 and gl_LightSource[0].diffuse, etc. (and similar Cg syntax in the arb profiles?) – training wheels that let you implement fixed function pipeline and variants in shaders?[/QUOTE]

No, I mean I know it’s possible, but never ever heard somebody using those in a deferred renderer. That’s kind of new to me.

I tried uniform arrays, but they were even slower than the gl_LightSource[…]… stuff. So, now I am trying UBO’s, and the shader compiles just fine (no errors), but fails to validate, and returns a garbage error log length. Any ideas as to what may cause this?

The shader:


// G buffer
uniform sampler2D gPosition;
uniform sampler2D gDiffuse;
uniform sampler2D gSpecular;
uniform sampler2D gNormal;


// Specularity info
uniform vec3 viewerPosition;
uniform float shininess;

uniform Light
{
    vec3 position;
    vec3 color;
    float range;
    float intensity;
} lightData[16];


uniform int numLights;


uniform vec3 lightAttenuation;


void main()
{
    vec3 worldPos = texture2D(gPosition, gl_TexCoord[0].st).xyz;
    vec3 worldNormal = texture2D(gNormal, gl_TexCoord[0].st).xyz;


    vec3 sceneDiffuse = texture2D(gDiffuse, gl_TexCoord[0].st).rgb;
    vec3 sceneSpecular = texture2D(gSpecular, gl_TexCoord[0].st).rgb;


    vec4 finalColor = vec4(0.0, 0.0, 0.0, 0.0);


    for(int i = 0; i < numLights; i++)
    {
        vec3 lightDir = lightData[i].position - worldPos;
        float dist = length(lightDir);


        float lightRange = lightData[i].range;


        if(dist > lightRange)
            continue;
    
        lightDir /= dist;


        float lambert = dot(lightDir, worldNormal);


        if(lambert <= 0.0)
            continue;


        float fallOff = max(0.0, (lightRange - dist) / lightRange);


        float attenuation = clamp(fallOff * lightData[i].intensity * (1.0 / (lightAttenuation.x + lightAttenuation.y * dist + lightAttenuation.z * dist * dist)), 0.0, 1.0);


        // Specular
        vec3 lightRay = reflect(normalize(-lightDir), worldNormal);
        float specularIntensity = attenuation * pow(max(0.0, dot(lightRay, normalize(viewerPosition - worldPos))), shininess);
        specularIntensity = max(0.0, specularIntensity);


        finalColor += vec4(sceneDiffuse * attenuation * lambert * lightData[i].color + sceneSpecular * specularIntensity * lightData[i].color, 0.0);
    }


    gl_FragColor = finalColor;
}

Shader validation:


bool Shader::Finalize(unsigned int id)
{
    glLinkProgram(id);
    glValidateProgram(id);


    // Check if validation was successful
    int result;


    glGetProgramiv(id, GL_VALIDATE_STATUS, &result);


    if(result == GL_FALSE)
    {
        // Not validated, print out the log
        int logLength;


        glGetShaderiv(id, GL_INFO_LOG_LENGTH, &logLength);


        if(logLength <= 0)
        {
            std::cerr << "Unable to validate shader: Error: Invalid log length \"" << logLength << "\": Could not retrieve error log!" << std::endl;


            return false;
        }


        // Allocate the string
        char* log = new char[logLength];


        glGetProgramInfoLog(id, logLength, &result, log);


        std::cerr << "Unable to compiler program: " << log << std::endl;


        delete log;


        return false;
    }


    return true;
}

Sorry about jumping from one problem to the next! The information you have given me so far has been very helpful!

I’m not sure what you’re doing but in my experience uniform arrays are very fast. Also, I wasn’t trying to suggest this is potentially a uniform arrays vs. gl_LightSource issue (GLSL side) but rather a uniform arrays vs. glLightfv issue (CPU side). In the uniform array case, you can set up all your light attributes in one call (per light attribute, or for all). Whereas with glLightfv you set up each attribute for each light with its own call. But again, it goes back to what you are bound on. And we don’t know that yet.

If I were you I’d do some profiling on the CPU and GPU side. How much time are you actually using for culling, light rebining, drawing, etc? How many state changes are you doing for how many batches? What’s your min/max batch size? Use gDEBugger (or other tool) to dump all the GL calls you’re making in a frame and give it a scan for clues.

I didn’t try using gDEBugger yet, but I’ll look into it :). I did some profiling using timers (timer queries and normal timers), and found that:

  • Culling/tiling lights takes 0 (beyond resolution) to 2 ms. I tried various tile sizes (16x16, 32x32, 64x64, 128x128), and I get pretty much the same times.
  • Rendering lights takes 8 ms on GPU side
  • The rendering on the CPU side takes 10 - 40 ms (unacceptable!!!)
    Lights are batched in groups of 1-16 (depending on how many are on a tile). I tried 32 once, but it just crashed (too many uniforms). Eventually, I will probably query this to make the batches as large as the particular machine can handle. If no lights are in a tile, the tile is just ignored.
    So the issue is in setting up the uniforms, since that is pretty much all the CPU does when rendering the lights. It loops though all the tiles, sets the uniforms, and draws a quad for each. The quad drawing is fast.

For array uniforms, I just accumulated values for each pass in an array and then set the uniform array at the end. While there are 2 API calls in this version (setting array and giving the number of lights used in the pass), it runs as bad as 12 fps when a lot of lights are in the view. Using glLightfv dropped as low as 30. The non-tiled version never really dropped below 60 (this is all with about 150 lights), but I am running this on a pretty hi spec machine.

Since UBOs bind so quickly, I tried using an array of them and having each light keep its own UBO that it binds when rendering the tile. Almost all of the lights are static, so they don’t even need to be updated that often.

I set the uniforms like this:


void Shader::SetShaderParameter4fv(const std::string &name, const std::vector<float> &params)
{
    int paramLoc;

    std::unordered_map<std::string, int>::iterator it = m_attributeLocations.find(name);

    if(it == m_attributeLocations.end())
        m_attributeLocations[name] = paramLoc = glGetUniformLocationARB(m_progID, name.c_str());
    else
        paramLoc = it->second;


    // If location was not found
#ifdef DEBUG
    if(paramLoc == -1)
        std::cerr << "Could not find the uniform " << name << "!" << std::endl;
    else
        glUniform4fvARB(paramLoc, params.size(), &params[0]);
#else
    glUniform4fvARB(paramLoc, params.size(), &params[0]);
#endif
}

Only the array uniform runs slowly, the others are fine. The only thing different between the array and single value forms is the glUniform…ARB(…) call.

EDIT 1: I found that if I purposely error out the shader, it shows the warning “warning(#312) uniform block with instance is supported in GLSL 1.5”. However, if I request #version 150, it complains about gl_TexCoord, and still fails to validate the program. Using #version 400 gets rid of all warnings, but it again fails to validate.

EDIT 2:
I tried TBO’s as well, since they allow me to submit ALL lights at once :)! However, I ran into the same problem I did with UBOs: The shader does not validate. What could be the cause of this?

I solved the shader issue, it was a stupid mistake :p. I am now uploading all light data in one go using a TBO. However, something weird is happening: according to the normal timer, it takes 0 - 1 ms to do everything lighting related, and 4.5 ms according to the timer queries. That seems pretty unlikely, since I am still getting 30 - 140 fps. It cannot being anything besides the lighting, since without it, I get really high frame rates.

Also, is there a way I can get the minimum/maximum depth of a tile region (for depth culling) without using something like OpenCL? That would probably help boost performance a lot, since the lights are all in a maze-like indoor environment with lots of occluders.

Good deal.

However, something weird is happening: according to the normal timer, it takes 0 - 1 ms to do everything lighting related, and 4.5 ms according to the timer queries. That seems pretty unlikely, since I am still getting 30 - 140 fps.

Hmm, OK. By “normal timer” I assume you mean CPU-based timer. By timer query I assume you mean GPU-based timer. That’s very possible. In the former you’re timing how long it takes you to “queue” the work, and in the latter you’re timing how long it takes you to “do” the work. If you want the former to be closer to the latter, than put a glFinish() right before you stop the CPU timer. That forces the CPU to wait until all the queue GPU work is done before it returns. This introduces a large pipelining bubble, so in practice you’d never do this except possibly at end-of-frame.

That seems pretty unlikely, since I am still getting 30 - 140 fps.

I have no idea what you’re implying here. 30-140fps = 7-33ms, which even in the best case covers either timing.

Also, is there a way I can get the minimum/maximum depth of a tile region (for depth culling) without using something like OpenCL? That would probably help boost performance a lot, since the lights are all in a maze-like indoor environment with lots of occluders.

You can do the depth buffer reduction with a GLSL shader instead. Try that first. That should still be pretty fast. Ping-pong reduction is sometimes used here.

The main thing that OpenCL/CUDA bring to the table is use of the shared memory on the compute units (GPU multiprocessors). There are cases like reduction where your shader/kernel can execute more quickly if your algorithm takes advantage of that.

You can do the depth buffer reduction with a GLSL shader instead. Try that first. That should still be pretty fast. Ping-pong reduction is sometimes used here.

I’ll try that! However, how do I make sure it always renders the maximum depth? If it just interpolates all the depths, it will give an average instead of a maximum, so it may cull a light when it shouldn’t be culled.

Why would it interpolate?

You’re writing the shader. You can make it do whatever you want. Such as (for instance) read all the input depth values via texelFetch(), compute the min() and max() values across those input depth values, and output those min and max values on 2 MRTs (render targets), 0 and 1.