# Best solution for dealing with multiple light types

Hi all,

I am working on my own 3D engine and I recently ran into an issue when trying to combine different light types using a single shader. Multiple lights of a single type work fine, but when I combine a Point light (with cube map shadows), Directional light (with 2D shadows), and Spot lights (also with 2D shadows) things started to break. I found a solution to this problem, but I wonder if there is a better way of doing it. Let me first summarise my initial solution that failed and then talk about the solution I found.

I pass an array of lights to the shader that is used to render a mesh. This array is defined as follows in my shader:

``````#version 420

const int nr_lights = 5;

const int DIRECTIONAL_LIGHT = 0;
const int SPOT_LIGHT = 1;
const int POINT_LIGHT = 2;

struct light {
int type;
bool enabled;
vec4 position;
vec4 diffuse;
vec4 ambient;
vec4 specular;

float constant_attenuation;
float linear_attenuation;

vec3 direction;
float light_angle;

samplerCube cube_depth_texture;
};
uniform light lights[nr_lights];
``````

For spot lights and directional lights I use a 2D shadow sampler to project the depth values. For point lights I created a cube texture which contain the linearised depth values. The beef of the lighting calculations are in the fragment shader and read as follows:

``````
for (int i = 0; i < lights.length(); ++i)
{
if (!lights[i].enabled)
{
continue;
}

vec4 halfVector = normalize(H[i]);
vec4 lightVector = normalize(L[i]);

float dotValue = max(dot(normalVector, lightVector), 0.0);
if (dotValue > 0.0)
{
float distance = length(lights[i].position - worldPos);
float intensity = 1.0;
if (lights[i].type != DIRECTIONAL_LIGHT)
intensity = 1.0 / (lights[i].constant_attenuation + lights[i].linear_attenuation * distance + lights[i].quadratic_attenuation * distance * distance);
vec4 ambient = material_ambient * lights[i].ambient;

bool inLight = true;

if (lights[i].type == SPOT_LIGHT)
{
vec3 nLightToVertex = vec3(normalize(worldPos - lights[i].position));
float angleLightToFrag = dot(nLightToVertex, normalize(lights[i].direction));
float radLightAngle = lights[i].light_angle * 3.141592 / 180.0;

inLight = false;
}

if (inLight)
{
if (lights[i].type == SPOT_LIGHT || lights[i].type == DIRECTIONAL_LIGHT)
{
}
else if(lights[i].type == POINT_LIGHT)
{
float sampled_distance = texture(lights[i].cube_depth_texture, direction[i].xyz).r;
float distance = length(direction[i]);

if (distance > sampled_distance + 0.1)
}

vec4 diffuse = dotValue * lights[i].diffuse * material_diffuse;
vec4 specular = pow(max(dot(normalVector, halfVector), 0.0), 10.0) * material_specular * lights[i].specular;
outColor += intensity * shadowf * (diffuse + specular * 100);
}

outColor += intensity * ambient;
}
}
outColor += material_emissive;

``````

This clearly does not work due non-uniform control flow (a term I only learned about yesterday :)).

So, what I have done is to move all the texture lookups out of the non-uniform control flow. However, this means that I need to provide depth textures for all lights (even if they are not used for rendering) and sample both the cube and 2dShadow textures. Let me show you the updated fragment shader bit:

``````for (int i = 0; i < lights.length(); ++i)
{
float sampled_distance = texture(lights[i].cube_depth_texture, direction[i].xyz).r;
if (!lights[i].enabled)
{
continue;
}

vec4 halfVector = normalize(H[i]);
vec4 lightVector = normalize(L[i]);

float dotValue = max(dot(normalVector, lightVector), 0.0);
if (dotValue > 0.0)
{
float distance = length(lights[i].position - worldPos);
float intensity = 1.0;
if (lights[i].type != DIRECTIONAL_LIGHT)
intensity = 1.0 / (lights[i].constant_attenuation + lights[i].linear_attenuation * distance + lights[i].quadratic_attenuation * distance * distance);
vec4 ambient = material_ambient * lights[i].ambient;

bool inLight = true;

if (lights[i].type == SPOT_LIGHT)
{
vec3 nLightToVertex = vec3(normalize(worldPos - lights[i].position));
float angleLightToFrag = dot(nLightToVertex, normalize(lights[i].direction));
float radLightAngle = lights[i].light_angle * 3.141592 / 180.0;

{
inLight = false;
}
}

if (inLight)
{
if (lights[i].type == SPOT_LIGHT)
{
}
else if(lights[i].type == POINT_LIGHT)
{
float distance = length(direction[i]);

if (distance > sampled_distance + 0.1)
}

vec4 diffuse = dotValue * lights[i].diffuse * material_diffuse;
vec4 specular = pow(max(dot(normalVector, halfVector), 0.0), 10.0) * material_specular * lights[i].specular;
outColor += intensity * shadowf * (diffuse + specular * 100);
}

outColor += intensity * ambient;
}
}

outColor += material_emissive;
``````

This works! In my engine I create 2 dummy shadows of size 1x1, one is a GL_TEXTURE_2D stored as a GL_DEPTH_COMPONENT, the other is a GL_TEXTURE_CUBE_MAP that only stores GL_RED values. When less than 5 lights are needed to render a mesh I pass these values to the cube_depth_texture and depth_texture values of the respective light and set the isEnabled flag to false.

While this does work, it creates a lot of overhead. In the worst case, when no lights are being used, it will still sample 10 textures!

Is there a better way around this issue? My engine currently does forward rendering, it is not clear to me whether using a G-Buffer provides a cleaner solution. If I can I would like to stick to forward rendering, so any solution and comments you have are greatly appreciated.

Many thanks!
Bram

P.S. For those interested, my 3D enigne Dreaded Portal Engine can be found here: http://bramridder.com/index.php/personal/personal_projects/dreaded-portal-engine

An alternative is to avoid using texture lookup functions which perform implicit derivative calculations, and instead calculate derivatives or LoD explicitly outside of the conditional and pass the result to textureProjGrad() or textureProjLod().

However, this may still perform texture lookups in cases where the condition is false (it depends upon whether the hardware has branch instructions). If you’re going to be perform lookups regardless, it would be better to use a 1x1 texture (or force the use of the 1x1 mipmap level of some texture) for cases where you don’t need the result.

If the hardware doesn’t have branch instructions, then putting code inside a conditional doesn’t avoid the cost of executing it, only the side-effects. So e.g. setting [var]radLightAngle[/var] to π would avoid the need to use a conditional for the inside-cone test (cos(π)=-1, so the test will always be false).

[QUOTE=Bram Ridder;1289681]While this does work, it creates a lot of overhead. In the worst case, when no lights are being used, it will still sample 10 textures!

Is there a better way around this issue? My engine currently does forward rendering, it is not clear to me whether using a G-Buffer provides a cleaner solution. If I can I would like to stick to forward rendering, so any solution and comments you have are greatly appreciated.[/QUOTE]

I’d definitely see if you can meet your goals with small changes to your shader logic as GClements is suggesting.

If after pursuing those, you bench your app and determine that the performance still isn’t up to the level you need, profile carefully to determine exactly what the biggest bottleneck is (it helps to gather a few worst-case test cases). You can use the results as a filter to evaluate which tech approaches will reduce that inefficiency the most. Just using some intuition about how your rendering algorithms work will save time with this.

If the main bottleneck ends up being the fact that you’re using a shader supporting max(lights) and max(shadows) for all fragments on the entire screen and you can’t easily avoid most of inefficiency associated with that with small shader changes, consider a tiled or clustered shading approach. Given your desire to stick with forward and the drawbacks of deferred approaches (which aren’t insurmountable, but do require nontrivial effort), I’d suggest looking most closely at tiled or clustered forward shading techniques (websearch: tiled forward, clustered forward, and forward+ for the latest papers, blog posts, and conference presentations). However, be sure and profile other aspects of your rendering too (e.g. shadow casting and culling).

Thanks for the very helpful feedback.

I agree that using TextureProjGrad() or textureProjLod() is one way to solve this problem. Although, as Dark Photon mentioned, I need to check whether doing texture lookups using 1x1 textures does create a bottleneck.

Thank you Dark Photon for letting me know about Forward+ and clustered methods. Did not even know these existed, very exiting!

At the moment I cannot use more than 5 lights per mesh. I guess this is because the limit of 16 textures per shader? Or is there another limit that prohibits using an array of say 32 lights?

In any case I have some research and then some coding to do :).

Your “struct light” has 43 components; 6 of those would total 258 components, which may be exceeding some implementation limit. You can get around that by using textures (e.g. buffer textures), or you may be able to use uniform blocks or shader storage blocks. Note that you’d need to keep the samplers separate; you can’t store samplers in uniform blocks, shader storage blocks or textures.

If you hit the limit on the number of texture units, consider using array textures. These effectively allow you to aggregate multiple textures into a single texture, with the constraint that all layers must have the same format and dimensions, and sampling parameters (e.g. filter and wrap modes) apply to the texture as a whole.

You can use bindless texture or texture arrays to get past the 16 textures/shader.

However, even if textures weren’t limiting you (e.g. no point or spot light shadows), I suspect you’ll hit other problems trying to push the number of lights up to even 32. If I were you, I’d just try it. This will provide valuable profiling data on which to base your future design decisions, and you can also see if you hit any big performance drop-offs or blocks as you increment the number of lights applied simultaneously from 1 to32.

It’s been years, but it seems like when I pushed up the number of lights being applied simultaneously in every fragment shader execution to 32 I hit a performance cliff or two and a wall before I got there with the way I was doing it. Seems like at least one cliff had to do with the GLSL compiler (in NVidia’s driver) dynamically determining the maximum number of iterations to automatically unroll loops in the shader (at the time, I was generating a shader permutation with the number of lights baked in). When it flipped to not unrolling I hit a big perf drop-off IIRC (NOTE: Whether and when the compiler unrolls loops can be controlled with a #pragma directive). Pushing the number of lights up even further resulted in hitting a limit with the max amount of uniform space I could pass into the shader using standard uniforms. This of course can be bypassed by any number of methods (SSBOs, UBOs, TBOs, etc.), but with potential performance reductions. Not sure any of this is useful to you nowadays (OpenGL has moved on), but I just mention it in case you do hit perf cliffs or walls with your profiling to give you a few possible potential causes to check into to see if they apply in your case. But long story short, doing this test made it blatantly obvious that I couldn’t get where I wanted to go with the GPU by just simple forward shading. I ended up implementing Deferred Shading which supported 100s-1000s of lights even without tile-based deferred, but that was before Tiled/clustered forward and Forward+ like approaches (nowadays and knowing what I know about deferred’s limitations and challenges, I’d seriously consider using Tiled/clustered Forward/Forward+ like approaches instead).

Brilliant! Thanks for the very insightful replies.

Quick update. I implemented some of your recommendations and I have successfully rendered a scene with 56 lights (including shadow maps)! I now use two UBOs, one for the view and projection matrix and the other for all the lighting information. I ran into an issue with having to many outs in my vertex shader so I moved all the calculation to the fragment shader (doing so somehow doubled my FPS ). While I am happy it works, it really shouldn’t…

As far as I understand I should have exceeded the sampler limit in the fragment shader (GL_MAX_TEXTURE_IMAGE_UNITS), but it just seems to work. Maybe you can help me figure out what is going on. Let me present the shaders I use at the moment:

``````
#version 420
uniform vec4 material_ambient;
uniform vec4 material_diffuse;
uniform vec4 material_specular;
uniform vec4 material_emissive;

const int nr_lights = 60;

const int DIRECTIONAL_LIGHT = 0;
const int SPOT_LIGHT = 1;
const int POINT_LIGHT = 2;

layout (std140) uniform Lights
{
int type[nr_lights];
bool enabled[nr_lights];
vec4 position[nr_lights];
vec4 diffuse[nr_lights];
vec4 ambient[nr_lights];
vec4 specular[nr_lights];

float constant_attenuation[nr_lights];
float linear_attenuation[nr_lights];

vec3 direction[nr_lights];
float light_angle[nr_lights];
} lights;

uniform samplerCube cube_depth_texture[nr_lights];

layout (std140) uniform Matrices
{
mat4 projection_matrix;
mat4 view_matrix;
};

uniform mat4 model_matrix;

in vec3 a_Vertex;
in vec2 a_TexCoord0;
in vec3 a_Normal;

out vec2 texCoord0;
out vec4 worldPos;
out vec4 pos;
out vec4 N;

void main(void)
{
texCoord0 = a_TexCoord0;
pos = view_matrix * model_matrix * vec4(a_Vertex, 1.0);
worldPos = model_matrix * vec4(a_Vertex, 1.0);
N = view_matrix * model_matrix * vec4(a_Normal, 0.0);
gl_Position = projection_matrix * pos;
}

``````

``````
#version 420

uniform vec4 material_ambient;
uniform vec4 material_diffuse;
uniform vec4 material_specular;
uniform vec4 material_emissive;

const int nr_lights = 60;

const int DIRECTIONAL_LIGHT = 0;
const int SPOT_LIGHT = 1;
const int POINT_LIGHT = 2;

layout (std140) uniform Lights
{
int type[nr_lights];
bool enabled[nr_lights];
vec4 position[nr_lights];
vec4 diffuse[nr_lights];
vec4 ambient[nr_lights];
vec4 specular[nr_lights];

float constant_attenuation[nr_lights];
float linear_attenuation[nr_lights];

vec3 direction[nr_lights];
float light_angle[nr_lights];
} lights;

uniform samplerCube cube_depth_texture[nr_lights];

layout (std140) uniform Matrices
{
mat4 projection_matrix;
mat4 view_matrix;
};

uniform sampler2D texture0;
uniform float transparency;

in vec2 texCoord0;
in vec4 worldPos;
in vec4 pos;
in vec4 N;

out vec4 outColor;

void main(void) {

if (texture(texture0, texCoord0.st).a == 0.0)
{
}

outColor = vec4(0, 0, 0, 1);

vec4 normalVector = N;
if (N != vec4(0, 0, 0, 0))
{
normalVector = normalize(N);
}

for (int i = 0; i < nr_lights; ++i)
{
if (!lights.enabled[i])
{
break;
}

vec3 lightPos = (view_matrix * lights.position[i]).xyz;
vec4 L;
vec4 H;
vec4 direction;

if (lights.type[i] == DIRECTIONAL_LIGHT)
{
L = vec4(-lights.direction[i], 0.0);
H = vec4((-lights.direction[i]).xyz, 1.0) - pos;
direction = L;
}
else
{
L = vec4(lightPos - pos.xyz, 0.0);
H = vec4((lightPos - pos.xyz).xyz, 1.0) - pos;
direction = worldPos - lights.position[i];
}

float sampled_distance = texture(cube_depth_texture[i], direction.xyz).r;

vec4 halfVector = normalize(H);
vec4 lightVector = normalize(L);

float dotValue = max(dot(normalVector, lightVector), 0.0);
if (dotValue > 0.0)
{
float distance = length(lights.position[i] - worldPos);
float intensity = 1.0;

if (lights.type[i] != DIRECTIONAL_LIGHT)
intensity = 1.0 / (lights.constant_attenuation[i] + lights.linear_attenuation[i] * distance + lights.quadratic_attenuation[i] * distance * distance);
vec4 ambient = material_ambient * lights.ambient[i];

bool inLight = true;

if (lights.type[i] == SPOT_LIGHT)
{
vec3 nLightToVertex = vec3(normalize(worldPos - lights.position[i]));
float angleLightToFrag = dot(nLightToVertex, normalize(lights.direction[i]));
float radLightAngle = lights.light_angle[i] * 3.141592 / 180.0;

{
inLight = false;
}

}

if (inLight)
{
if (lights.type[i] == SPOT_LIGHT)
{
}
else if(lights.type[i] == POINT_LIGHT)
{
float distance = length(direction);

if (distance > sampled_distance + 0.1) {
}
}
vec4 diffuse = dotValue * lights.diffuse[i] * material_diffuse;
vec4 specular = pow(max(dot(normalVector, halfVector), 0.0), 10.0) * material_specular * lights.specular[i];
outColor += intensity * shadowf * (diffuse + specular * 100);
}

outColor += intensity * ambient;
}
}

outColor += material_emissive;
outColor *= texture(texture0, texCoord0.st);
}

``````

My understanding is that currently 121 textures are currently used by the Fragment shader; texture0 + 60 * (depth_texture + cube_depth_texture). When I check the value of GL_MAX_TEXTURE_IMAGE_UNITS on my GPU it returns 32. GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS returns 160, which would be enough but should not apply to solely the fragment shader. What am I not understanding and what magic is being used?

Thanks again for your help improving my 3D engine :D.

Just for curiosity:

I suppose it is few compared to what the fragment shader can support.

[QUOTE=Silence;1289791]Just for curiosity:

I suppose it is few compared to what the fragment shader can support.

GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS returns 32 as well. It is actually an AMD card: SAPPHIRE Radeon RX 580 NITRO+ 4 GB GDDR5.

OK. I will try to try that as well on my side. Thank you.

Well, I just want to know what these values mean. Because it seems to me that I am going over these limits with the number of textures I use, yet everything works fine.

Well, these are depicted here.

And for the combined, if both shader access the same unit, this is counted as 2.

From what I know and what I understand, these should be hard limits, not hints. Reading the relevant parts of the spec might also give some clue.

Cool! Congrats on getting it up and running.

I ran into an issue with having to many outs in my vertex shader so I moved all the calculation to the fragment shader (doing so somehow doubled my FPS ). While I am happy it works, it really shouldn’t…

That’s interesting. A few thoughts: vertex shaders are executed for all vertices, not just all vertices 1) in the view frustum 2) which aren’t occluded by any other objects (the latter for depth-tested geometry). I wonder if #1 and #2 might help explain the performance difference?

Also, how many interpolators (varyings; aka vertex out/fragment in) were you using before versus now? I think I recall reading that the more of these you use, the fewer vertex shader threads can execute in parallel (on the same GPU), so the slower your vertex transform work executes. Not sure if that correlates with your problem though.

As far as I understand I should have exceeded the sampler limit in the fragment shader (GL_MAX_TEXTURE_IMAGE_UNITS), but it just seems to work. Maybe you can help me figure out what is going on.

Hmmm. 121 is certainly > 32 (your GL_MAX_TEXTURE_IMAGE_UNITS, which I believe is the bound texture access limit for fragment shaders). GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS is the limit across all shader units, and you’re well under that. It sounds like GL_MAX_TEXTURE_IMAGE_UNITS isn’t really the hard upper limit for fragment shaders (on your GL drivers at least), but it sounds like that’s operating outside of the spec, so your code in general might not work on other such drivers where you’re over-the-limit.

[QUOTE=Dark Photon;1289803]Cool! Congrats on getting it up and running.[/QUOTE] Thanks

[QUOTE=Dark Photon;1289803]
Also, how many interpolators (varyings; aka vertex out/fragment in) were you using before versus now?[/QUOTE] Before I used #lights * 4 + 3, now I just use 3. So I think that might explain it. Because even in scenes where all vertices are within the frustrum and no occlusion occurs I still get the speedup.

[QUOTE=Dark Photon;1289803]
Hmmm. 121 is certainly > 32 (your GL_MAX_TEXTURE_IMAGE_UNITS, which I believe is the bound texture access limit for fragment shaders). GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS is the limit across all shader units, and you’re well under that. It sounds like GL_MAX_TEXTURE_IMAGE_UNITS isn’t really the hard upper limit for fragment shaders (on your GL drivers at least), but it sounds like that’s operating outside of the spec, so your code in general might not work on other such drivers where you’re over-the-limit.[/QUOTE]Something is definitely going on in the drivers. I was trying to break my shaders by going over the limit, but it just did not happen. Was ready to move to array textures, but I never needed to.

I even tested it on a mobile NVIDEA GPU and it does not break a sweat. You are probably right though and stick to the spec.

[QUOTE=Dark Photon;1289803]Cool! Congrats on getting it up and running.
[/QUOTE]
Thanks :).

I used to use 4 * #lights + 3, now I only use 5 or so. So thay might explain it. I even get the speedup with very simple scenes where all vertices are withing the frustrum and none of them are occluded.

[QUOTE=Dark Photon;1289803]
Hmmm. 121 is certainly > 32 (your GL_MAX_TEXTURE_IMAGE_UNITS, which I believe is the bound texture access limit for fragment shaders). GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS is the limit across all shader units, and you’re well under that. It sounds like GL_MAX_TEXTURE_IMAGE_UNITS isn’t really the hard upper limit for fragment shaders (on your GL drivers at least), but it sounds like that’s operating outside of the spec, so your code in general might not work on other such drivers where you’re over-the-limit.[/QUOTE]
Something must be going on in the driver. I was trying to break my shaders by pushing the number of lights higher and higher, but it never did. Even on a mobile NVIDEA GPU on my laptop it works fine.

You are right though. I should stay within the specs and start using texture arrays or Atlas textures for my depth maps.

Good to know I understand the limits and should be surprised :).

Just out of curiousity, did you try using more than GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS?

I don’t know that that should break it, but it would confirm or refute that (on NVidia drivers at least) this is or isn’t a hard upper limit.

I really doubt that it is though because it seems that (on recent NVidia drivers) the value for this is always 5 times or 6 times the max number of texture per shader state (which you found is bogus). 5 is the number of shader stages w/o compute, and 6 is the number of shader stages with compute. For instance, on your card 325=160. Here on the NVidia card I was running on, 326=192. For the *6, no clue why they’d count compute as a stage in the shader pipeline, since it doesn’t coexist with the others in a program (AFAIK).

You are right though. I should stay within the specs and start using texture arrays or Atlas textures for my depth maps.

Also check out Bindless Texture. For some cases, this is much more convenient than texture arrays. And both of these in my opinion have fewer drawbacks than texture atlases (unless you’re targeting really, really old hardware).

With the same GC than the OP (AMD RX 580) I have almost the same parameters: 32, 32 and 192 (the last one for the combine), which look more like what Dark Photon found on its nVidia card.

I use free drivers on Mesa / Linux here.

[QUOTE=Dark Photon;1289814]Just out of curiousity, did you try using more than GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS?

I don’t know that that should break it, but it would confirm or refute that (on NVidia drivers at least) this is or isn’t a hard upper limit.

I really doubt that it is though because it seems that (on recent NVidia drivers) the value for this is always 5 times or 6 times the max number of texture per shader state (which you found is bogus). 5 is the number of shader stages w/o compute, and 6 is the number of shader stages with compute. For instance, on your card 325=160. Here on the NVidia card I was running on, 326=192. For the *6, no clue why they’d count compute as a stage in the shader pipeline, since it doesn’t coexist with the others in a program (AFAIK).[/QUOTE]
Yeah, if I go over that limit then it does break (tried 162 and got some crazy effects).

Does the latest DOOM not use texture atlases for their depth maps? At this stage I do not know the pros and cons of these different ways of handling textures. But I will look into bindless textures, it seems to promise to be able to draw everything with one draw call which is kind of neat :P.

Interesting. Good to know.

Does the latest DOOM not use texture atlases for their depth maps?

No clue.

But I will look into bindless textures, it seems to promise to be able to draw everything with one draw call which is kind of neat :P.

Yeah, I’d just read the wiki page so you’ve got that in your back of tricks, and can pull it out if/when you determine that you need it.