Sampler Array Limit with bindless textures?

Hello,

I am currently revising my 3D engine using ARB_bindless_texture to improve structure and performance.
The engine uses a deferred rendering pipeline and what I’ve been doing until now was to render all material information into the gbuffer and then use one shader for each light type to calculate lighting in screen space.
After reading the bindless textures chapter in the SuperBible 7, I had the idea to just output position, normal, tangent, uv and one material id to the gbuffer and then do all the lighting calculation in a single huge shader that gets all the textures, shadow maps, etc. as bindless textures.

Right now, the lighting shader looks like this (work in progress):

#version 450 core

#extension GL_ARB_bindless_texture : require


uniform vec3 cam_pos_uni;

uniform sampler2D position_tex_uni;
uniform sampler2D normal_tex_uni;
uniform sampler2D tang_tex_uni;
uniform sampler2D uv_tex_uni;
uniform usampler2D material_tex_uni;

in vec2 uv_coord_var;

out vec4 color_out;


uniform sampler2D diffuse_tex_uni[128];



// ambient

uniform vec3 light_ambient_color_uni;

// point lights

#define MAX_POINT_LIGHTS_COUNT 8

uniform int point_light_count_uni;
uniform vec3 point_light_pos_uni[MAX_POINT_LIGHTS_COUNT];
uniform vec3 point_light_color_uni[MAX_POINT_LIGHTS_COUNT];
uniform float point_light_distance_uni[MAX_POINT_LIGHTS_COUNT];
uniform bool point_light_shadow_enabled_uni[MAX_POINT_LIGHTS_COUNT];


uniform samplerCube point_light_shadow_map_uni[MAX_POINT_LIGHTS_COUNT];


float linstep(float min, float max, float v)
{
    return clamp((v - min) / (max - min), 0.0, 1.0);
}

void main(void)
{
    ivec2 texel_uv = ivec2(uv_coord_var * textureSize(position_tex_uni, 0).xy);

    vec3 position = texelFetch(position_tex_uni, texel_uv, 0).xyz;
    vec3 normal = texelFetch(normal_tex_uni, texel_uv, 0).xyz;
    vec4 tang_data = texelFetch(tang_tex_uni, texel_uv, 0);
    vec3 tang = tang_data.xyz;
    vec2 uv = texelFetch(uv_tex_uni, texel_uv, 0).xy;
    uint material_index = texelFetch(material_tex_uni, texel_uv, 0).x;
    
    normal = normalize(normal);
    tang = normalize(tang);
    vec3 bitang = cross(normal, tang);
    if(tang_data.w < 0.0)
        bitang *= -1.0;
    bitang = normalize(bitang);
        
    vec3 cam_dir = normalize(cam_pos_uni - position.xyz);
    
    // material data
    
    vec3 diffuse = texture(diffuse_tex_uni[material_index], uv).rgb;
    vec4 specular = vec4(1.0, 1.0, 1.0, 32.0); // TODO
    
    
    
    
    // ambient lighting
    
    vec3 color = diffuse * light_ambient_color_uni;
    
    
    // point lighting
    
    float shadow;
    vec3 light_dir;
    float light_dist_sq;
    float light_dist;
    float light_dist_attenuation;
    float light_intensity;
    vec3 specular_color;
    float specular_intensity;
    
    for(int i=0; i<point_light_count_uni; i++)
    {
        shadow = 1.0;
    
        light_dir = point_light_pos_uni[i] - position.xyz; // pos to light
        light_dist_sq = light_dir.x * light_dir.x + light_dir.y * light_dir.y + light_dir.z * light_dir.z; // squared distance
        if(light_dist_sq <= point_light_distance_uni[i] * point_light_distance_uni[i])
        {
            light_dist = sqrt(light_dist_sq); // real distance
            light_dir /= light_dist; // normalized dir
            
            if(point_light_shadow_enabled_uni[i])
            { 
                vec2 moments = texture(point_light_shadow_map_uni[i], -light_dir).rg;
                //vec2 moments = vec2(0.0);
                
                float light_depth = length(point_light_pos_uni[i] - position.xyz) - 0.01;
                            
                // Surface is fully lit. as the current fragment is before the light occluder
                if(light_depth <= moments.x)
                    shadow = 1.0;
                else
                {
                    float p = smoothstep(light_depth-0.00005, light_depth, moments.x);
                    float variance = max(moments.y - moments.x*moments.x, -0.001);
                    float d = light_depth - moments.x;
                    float p_max = linstep(0.3, 1.0, variance / (variance + d*d));
                    
                    shadow = p_max;//clamp(max(p, p_max), 0.0, 1.0);
                }
            }
            else
                shadow = 1.0;
        
        
            light_dist_attenuation = (1.0 - light_dist / point_light_distance_uni[i]);
            light_intensity = max(dot(normal, light_dir), 0.0) *  light_dist_attenuation;
            color += shadow * light_intensity * point_light_color_uni[i] * diffuse.rgb; // diffuse light
        
            //specular
            specular_color = specular.rgb * point_light_color_uni[i];
            specular_intensity = max(dot(normalize(reflect(-light_dir, normal)), cam_dir), 0.0);
            color += max(vec3(0.0, 0.0, 0.0), specular_color * pow(specular_intensity, specular.a)) * shadow * light_dist_attenuation;
        }
    }
    
    
    
    
    
    
    color_out = vec4(color, 1.0);
}

What I’ve implemented so far works fine. I am using glUniformHandleui64vARB to set diffuse_tex_uni and point_light_shadow_map_uni.
The problem is, when I raise the sizes of these arrays in the shader, e.g. point_light_shadow_map_uni[16] instead of point_light_shadow_map_uni[8], the code still seems to compile, but it behaves like it did not (same strange behaviour as when I just write some syntactically wrong stuff).
Also, I am getting a lot of GL_INVALID_OPERATION errors then, which I have not really been able to track yet, because the error message from OpenGL is simply “GL_INVALID_OPERATION error generated. State(s) are invalid: .” and CodeXL, gDebugger and bugle all crash, because they don’t support ARB_bindless_texture.

So, is there any limit for how many textures I can use? I thought there is none?
Or will it only work when I put these uniforms in a uniform block and set them from a buffer instead of glUniformHandleui64vARB?

I had the idea to just output position, normal, tangent, uv and one material id to the gbuffer and then do all the lighting calculation in a single huge shader that gets all the textures, shadow maps, etc. as bindless textures.

Ignoring the issues in your code for a moment, that is a terrible idea. Outputting positions at all is just a waste of bandwidth (a precious resource in any deferred renderer), since you can easily re-generate them in the deferred pass from just the depth.

Also, it means that, for every lighting pass, you have to sample from your various material property textures (colors, normals, etc) in addition to the gbuffers. Sure, you may not need to do much sampling in your geometry pass, but you’re creating a substantial imbalance here. You’ve made your initial passes faster, only to make your lighting passes at least that much slower if not moreso.

Plus, your material texture accesses will lack coherency. Just because there’s no cost in binding a texture does not mean that you can just fetch willy-nilly across all available textures without performance consequences. If two neighboring fragment shaders have to sample from different textures, then they will then their fetches will not be from the same region of memory. Each execution’s fetch will be entirely separate, so they both pay the penalty of a memory access. If the fetches were adjacent to one another, in the same texture, then there would only need to be one memory fetch, not two (or rather, 2 instead of 4 since you’re probably using mipmapping).

The only advantage of this idea is that you only fetch material data from the visible texels. And you could get that with a simple depth pre-pass.

Overall, this is not likely to help your performance.

The problem is, when I raise the sizes of these arrays in the shader, e.g. point_light_shadow_map_uni[16] instead of point_light_shadow_map_uni[8], the code still seems to compile, but it behaves like it did not (same strange behaviour as when I just write some syntactically wrong stuff).

Wait, if you write syntactically invalid code, then it doesn’t “seem to compiler”. Because it doesn’t compile. So you can’t be getting the same behavior.

Are you correctly checking to see if your shader compiles?

Also, be aware that, while bindless texturing allows you to use any number of textures, that doesn’t mean that uniform storage suddenly became infinite. Each sampler in your shader takes up uniform space. So if you cross that limitation, then your code will fail to compile.

This is what UBOs and SSBOs are for.

Also, I am getting a lot of GL_INVALID_OPERATION errors then, which I have not really been able to track yet, because the error message from OpenGL is simply “GL_INVALID_OPERATION error generated. State(s) are invalid: .” and CodeXL, gDebugger and bugle all crash, because they don’t support ARB_bindless_texture.

Try glIntercept, if it’s available for your platform. It gracefully handles extensions that it doesn’t recognize.

Thanks for the quick answer!

[QUOTE=Alfonse Reinheart;1279312]Ignoring the issues in your code for a moment, that is a terrible idea. Outputting positions at all is just a waste of bandwidth (a precious resource in any deferred renderer), since you can easily re-generate them in the deferred pass from just the depth.

Also, it means that, for every lighting pass, you have to sample from your various material property textures (colors, normals, etc) in addition to the gbuffers. Sure, you may not need to do much sampling in your geometry pass, but you’re creating a substantial imbalance here. You’ve made your initial passes faster, only to make your lighting passes at least that much slower if not moreso.

Plus, your material texture accesses will lack coherency. Just because there’s no cost in binding a texture does not mean that you can just fetch willy-nilly across all available textures without performance consequences. If two neighboring fragment shaders have to sample from different textures, then they will then their fetches will not be from the same region of memory. Each execution’s fetch will be entirely separate, so they both pay the penalty of a memory access. If the fetches were adjacent to one another, in the same texture, then there would only need to be one memory fetch, not two (or rather, 2 instead of 4 since you’re probably using mipmapping).

The only advantage of this idea is that you only fetch material data from the visible texels. And you could get that with a simple depth pre-pass.

Overall, this is not likely to help your performance.[/QUOTE]
I’ve already thought about most of these issues. The bindless thing is more or less an experiment to see if it can improve the performance, because my biggest issue is that the gbuffer grows with the the features that my materials should have.
Calculating the position from the depth is also something I am also planning to do in the future if the precision of the final values is sufficient, but for now, I’m just trying to get it working that way.

[QUOTE=Alfonse Reinheart;1279312]Wait, if you write syntactically invalid code, then it doesn’t “seem to compiler”. Because it doesn’t compile. So you can’t be getting the same behavior.

Are you correctly checking to see if your shader compiles?[/QUOTE]
Sorry, I probably have just not formulated that clear enough. The code as it is does compile, GL_COMPILE_STATUS and GL_LINK_STATUS are all GL_TRUE, even with bigger arrays. But with bigger arrays, the output of the shader when rendering looks like it did not compile.

[QUOTE=Alfonse Reinheart;1279312]
Also, be aware that, while bindless texturing allows you to use any number of textures, that doesn’t mean that uniform storage suddenly became infinite. Each sampler in your shader takes up uniform space. So if you cross that limitation, then your code will fail to compile.

This is what UBOs and SSBOs are for.[/QUOTE]
So, you do think the problem is that I’m using normal uniforms with glUniform* instead of uniform blocks with UBOs? I think I’ll try that out…

the output of the shader when rendering looks like it did not compile.

That’s the part I don’t understand. If a shader didn’t compile, then you can’t render with it; you’ll get an error instead with a failure to render the requested operation. Are you saying that you’re getting an OpenGL error and a failure to render anything? Or are you saying that you’re getting some visual result which is distinct from commenting out the draw calls?

So, you do think the problem is that I’m using normal uniforms with glUniform* instead of uniform blocks with UBOs?

No, I think the problem may be that you’re blowing past some compiler limit on Uniform components when you use non-block uniforms for them.