glLinkProgram is extremely slow the first time I run it on a specific shader

r-jaoui · July 24, 2021, 12:11pm

Hello !

I’m currently having fun with shaders but I’m noticing quite an extreme slowdown when running the glLinkProgram function on shaders in specific cases.

The slowdown only seems to happen when the shader is first compiled (meaning that if I run the program, the first time around the glLinkProgram can take up to 30-40s to execute, but if I close the program and open it again, which calls glLinkProgram again, it is instantaneous). It happens if I launch the program after I restart my computer, but only the first time it is executed, or when the shader file is modified.

If the shader fails to compile, then the functions is instantaneous (and calling glGetShaderiv(vertexShaderID, GL_COMPILE_STATUS, success); yields an error code, which is expected since in that case my code resorts to a default basic fragment shader) and, weirdly, the same shader, with the same functions and uniforms but that, for example, only returns a black color, is linked instantaneously aswell…

Here is an example of a shader that is slow (not all are, the simple shaders are always pretty fast to link, and also note that this is OpenGL 3.0 shaders and that I know that this is really old OpenGL). This shader creates fractal noise maps :

Vertex Shader

void main() {
	gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
	gl_FrontColor = gl_Color;
	gl_BackColor = gl_Color;
}

Fragment Shader

uniform sampler2D screenTexture;
uniform ivec2 screenResolution;
uniform vec4 region = vec4(0, 0, 1, 1);

uniform int seed;
uniform int numOctaves = 1;
uniform vec2 baseFrequency = vec2(0.01, 0.01);
uniform bool type = false;
uniform bool stitchTiles = false;

uniform float time = 0;

uint hash(uint x) {
	x += ( x << 10u );
	x ^= ( x >>  6u );
	x += ( x <<  3u );
	x ^= ( x >> 11u );
	x += ( x << 15u );
	return x;
}

float random(int i, vec3 pos) {
	const uint mantissaMask = 0x007FFFFFu;
	const uint one          = 0x3F800000u;
	
	uint h = hash(i + hash(floatBitsToUint(pos.x)) + hash(hash(floatBitsToUint(pos.y))) + hash(hash(hash(floatBitsToUint(pos.z)))));
	h &= mantissaMask;
	h |= one;
	
	float r2 = uintBitsToFloat(h);
	return r2 - 1.0;
}

vec3 hashVect(int i, vec3 p) {
	return vec3(
		2 * random(i  , p) - 1,
		2 * random(i+1, p) - 1,
		2 * random(i+2, p) - 1);
}

float noise(int seed, vec3 p, float freq, bool repeat){
	vec3 i = floor(p * freq);
    vec3 f = fract(p * freq);
	
	vec3 u = f*f*f*(10.0-15.0*f+6.0*f*f);

	float v1, v2, v3, v4, v5, v6, v7, v8;
	if(repeat){
		float fact1 = (round(screenResolution.x * max(baseFrequency.x, 0)) * freq == i.x + 1) ? 0 : 1;
		float fact2 = (round(screenResolution.y * max(baseFrequency.y, 0)) * freq == i.y + 1) ? 0 : 1;

		v1 = dot(hashVect(seed, vec3(i.x              , i.y              , i.z    )), f - vec3(0.,0.,0.));
		v2 = dot(hashVect(seed, vec3(fact1 * (i.x + 1), i.y              , i.z    )), f - vec3(1.,0.,0.));
		v3 = dot(hashVect(seed, vec3(i.x              , fact2 * (i.y + 1), i.z    )), f - vec3(0.,1.,0.));
		v4 = dot(hashVect(seed, vec3(fact1 * (i.x + 1), fact2 * (i.y + 1), i.z    )), f - vec3(1.,1.,0.));
		v5 = dot(hashVect(seed, vec3(i.x              , i.y              , i.z + 1)), f - vec3(0.,0.,1.));
		v6 = dot(hashVect(seed, vec3(fact1 * (i.x + 1), i.y              , i.z + 1)), f - vec3(1.,0.,1.));
		v7 = dot(hashVect(seed, vec3(i.x              , fact2 * (i.y + 1), i.z + 1)), f - vec3(0.,1.,1.));
		v8 = dot(hashVect(seed, vec3(fact1 * (i.x + 1), fact2 * (i.y + 1), i.z + 1)), f - vec3(1.,1.,1.));
	}
	else{
		v1 = dot(hashVect(seed, i + vec3(0.,0.,0.)), f - vec3(0.,0.,0.));
		v2 = dot(hashVect(seed, i + vec3(1.,0.,0.)), f - vec3(1.,0.,0.));
		v3 = dot(hashVect(seed, i + vec3(0.,1.,0.)), f - vec3(0.,1.,0.));
		v4 = dot(hashVect(seed, i + vec3(1.,1.,0.)), f - vec3(1.,1.,0.));
		v5 = dot(hashVect(seed, i + vec3(0.,0.,1.)), f - vec3(0.,0.,1.));
		v6 = dot(hashVect(seed, i + vec3(1.,0.,1.)), f - vec3(1.,0.,1.));
		v7 = dot(hashVect(seed, i + vec3(0.,1.,1.)), f - vec3(0.,1.,1.));
		v8 = dot(hashVect(seed, i + vec3(1.,1.,1.)), f - vec3(1.,1.,1.));
	}

    return mix(
		    	mix(
		    		mix(v1, v2, u.x),
		    		mix(v3, v4, u.x), u.y), 
		    	mix(
		    		mix(v5, v6, u.x), 
		    		mix(v7, v8, u.x), u.y), u.z);
}

float floatAbs(float f){
	if(f >= 0) return f;
	return -f;
}

vec4 pNoise(int s, vec3 p, int octave, bool repeat, bool turbulent){
	vec4 col = vec4(0, 0, 0, 0);

	float amp = 0;

	for(int i = octave; i > 0; i--){
		if(turbulent){
			vec4 colNew = vec4(
				pow(floatAbs(noise(s+1, p, pow(2, i - 1), repeat)), 0.8),
				pow(floatAbs(noise(s+2, p, pow(2, i - 1), repeat)), 0.8),
				pow(floatAbs(noise(s+3, p, pow(2, i - 1), repeat)), 0.8),
				pow(floatAbs(noise(s  , p, pow(2, i - 1), repeat)), 0.8));

			col = col + colNew * pow(0.5, i) * (1 - col.a);
			amp = pow(0.5, i);
		}
		else{
			float colNewA = 0.5 * noise(s, p, pow(2, i - 1), repeat) + 0.5;

			vec4 colNew = vec4(
				0.5 * noise(s+1, p, pow(2, i - 1), repeat) + 0.5,
				0.5 * noise(s+2, p, pow(2, i - 1), repeat) + 0.5,
				0.5 * noise(s+3, p, pow(2, i - 1), repeat) + 0.5,
				0.5 * noise(s, p, pow(2, i - 1), repeat) + 0.5);

			col = pow(0.5, i) * colNew + col;
			amp += pow(0.5, i);
		}
	}

	return col/amp;
}

bool insideRect(vec2 position, vec4 box){
	return position.x >= box.x && position.y >= box.y && position.x <= box.x + box.z && position.y <= box.y + box.w;
}

void main(){
	if(insideRect(gl_FragCoord.xy / screenResolution, region)){
		vec2 actualFrequency;

		if(stitchTiles) actualFrequency = round(screenResolution * max(baseFrequency, vec2(0, 0))) / screenResolution;
		else actualFrequency = max(baseFrequency, vec2(0, 0));

		gl_FragColor = pNoise(seed, vec3(gl_FragCoord.xy * actualFrequency, time), numOctaves, stitchTiles, !type);
	}
	else gl_FragColor = vec4(0, 0, 0, 0);
}

Thanks for any help

Dark_Photon · July 24, 2021, 1:18pm

Which GPU and driver? Post your GL_VERSION, GL_RENDERER, and GL_VENDOR.

Link is the expensive part. Optimizations happen here.

Delete the driver’s shader cache and you should see consistent results every run.

r-jaoui · July 24, 2021, 4:02pm

Well GPU is NVidia GeForce GTX 960, driver version is 21.21.13.7563,

GL_VERSION : 4.5.0 NVIDIA 375.63
GL_RENDERER : GeForce GTX 960/PCIe/SSE2
GL_VENDOR : NVIDIA Corporation

Update : I noticed that my GPU driver version is reaaaaally old and dates back to 2016 (the current one is 471.41 ^^’) so I’m currently updating it

How can the shader cache be deleted ? And should the shaders themselves be optimized or can the problem lie somwhere else ?

Thanks for your help by the way

r-jaoui · July 24, 2021, 4:58pm

Alright so with the drivers updated linking is instantaneous now, even for the shaders that were really slow

Thanks a lot @Dark_Photon ! I’d still be interested in ways to optimize shader compilation and execution of course if you have any ressources, as well as a way to delete the driver cache.

Dark_Photon · July 25, 2021, 10:01pm

Great! Glad you solved it!

Well by default, the NVIDIA driver’s OpenGL shader cache is stored at:

$HOME/.nv/GLCache/ on Linux and
%LOCALAPPDATA%\NVIDIA\GLCache\ on Windows
(e.g. C:\Users\YOUR_USERNAME\AppData\Local\NVIDIA\GLCache\)

However, you can change this location, tune/remove the its default size limitations, and/or disable its use with environment variables:

   __GL_SHADER_DISK_CACHE      (bool)         Enable/disable shader cache
   __GL_SHADER_DISK_CACHE_PATH (string)       Set shader cache storage dir
   __GL_SHADER_DISK_CACHE_SIZE (integer)      Set max shader cache size (units?) (default = 128 MB)
   __GL_SHADER_DISK_CACHE_SKIP_CLEANUP (bool) If set, no size limitation

For details, see:

Dark_Photon · July 26, 2021, 12:11am

Hey, on this subject. Just for kicks, I tossed your shaders into an old 2012 version of the NVIDIA GLSL compiler (the one baked into the Cg compiler, version 3.1). The vertex shader of course is trivial, takes almost no time to compile, and produces a shader with only 6 assembly instructions and consuming 1 R-register. The fragment shader however is more interesting.

First of all, it doesn’t even compile without a version header added. So since it uses gl_FragColor, I added this:

#version 420 compatibility

With that, it compiles successfully but takes 1.4 seconds to do so. That’s just the compile. Compiling to the gp5fp profile, the resulting shader has:

# 11703 instructions, 12 R-regs

That is, the compiled shader has 11,703 assembly language instructions and uses 12 R-registers. That’s a very hefty instruction count!

So I played with this a bit. This instruction count explosion appears to be due to inlining all of the nested function calls in your shader. There are various NVIDIA GLSL compiler macros you can use to tune behavior like this (incl for loop unrolling, etc.). And when I added this one to the top of the shader:

#pragma optionNV(inline 0)

to disable inlining, then the produced assembly language instruction counts dropped dramatically!:

# 364 instructions, 13 R-regs

and the compile time dropped from 1.407 seconds to 0.028 sec. That’s a 50X frag shader compile time speed-up right there! However, the produced shader assembly of course contains a number of subroutine calls in it whereas the previous “inlined” version had none.

So my guess is that at least some of the slowdown you were originally seeing might be due to the time consumed by inlining a bunch of nested function calls. And if you sometimes compile with numOctaves > 1 (which results in multiple for loop iterations), the time consumed by loop unrolling could be a factor here as well.

r-jaoui · July 26, 2021, 9:55am

Thank you so much

Yup found it where you said, and so since these are .bin files they’re just the compilation result of the shaders right ?

Oh that’s interesting… In my case I actually couldn’t get it to compile with any specified #version and I still can’t (compilation fails), which doesn’t make much sense to me, but I’m actually creating a rendering engine and I plan on updating it to work with the latest stable OpenGL version (assuming that older GPUs could still run my code ?)

Oh I didn’t know that the compiler did that, so when for example I have noise that calls hashVect 16 times, which itself calls random 3 times, which calls hash 7 times, the compiler actually copies the hash code the 336 times it needs to ?

I tried putting the #pragma optionNV(inline 0) directive in my shader, but although it still compiles with no errors, the shader then does nothing

Anyways, thanks for the huge help !

Dark_Photon · July 26, 2021, 1:10pm

That’s weird. I’d want to get to the bottom of that one. The ShaderInfoLog message will likely point out the problem.

Yup, if the shader is fully inlined.

The compiler might choose to do this to avoid wasting time on the overhead of subroutine calls and returns. But this might not always be what you want.

Hmmm. Did the shader seem to work properly without this directive in there?

Are you querying and dumping the ShaderInfoLog and testing for compile errors properly? Might double-check that. This and the #version behavior you’re seeing seems fishy.

r-jaoui · July 26, 2021, 4:16pm

Yep it’s working fine…

Actually, no, but now I did… On this fragment shader for example, here is the ShaderInfoLog :

0(1) : warning C7583: Initialization of uniform variables requires #version 120 or later
0(1) : warning C7583: Initialization of uniform variables requires #version 120 or later
0(1) : warning C7583: Initialization of uniform variables requires #version 120 or later
0(1) : warning C7583: Initialization of uniform variables requires #version 120 or later
0(1) : warning C7583: Initialization of uniform variables requires #version 120 or later
0(1) : warning C7583: Initialization of uniform variables requires #version 120 or later
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7532: global type uint requires "#version 130" or later
0(1) : warning C0000: ... or #extension GL_EXT_gpu_shader4 : enable
0(1) : warning C7548: '<<' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '>>' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '<<' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '>>' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '<<' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7532: global function floatBitsToUint requires "#version 330" or later
0(1) : warning C0000: ... or #extension GL_ARB_shader_bit_encoding : enable
0(1) : warning C0000: ... or #extension GL_ARB_gpu_shader5 : enable
0(1) : warning C7011: implicit cast from "int" to "uint"
0(1) : warning C7532: global function uintBitsToFloat requires "#version 330" or later
0(1) : warning C0000: ... or #extension GL_ARB_shader_bit_encoding : enable
0(1) : warning C0000: ... or #extension GL_ARB_gpu_shader5 : enable
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7532: global function round requires "#version 130" or later
0(1) : warning C0000: ... or #extension GL_EXT_gpu_shader4 : enable
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "int" to "float"
0(1) : warning C7011: implicit cast from "ivec2" to "vec2"
0(1) : warning C7011: implicit cast from "ivec2" to "vec2"
0(1) : warning C7532: global function round requires "#version 130" or later
0(1) : warning C0000: ... or #extension GL_EXT_gpu_shader4 : enable
0(1) : warning C7011: implicit cast from "ivec2" to "vec2"
0(1) : warning C7548: '<<' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '>>' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '<<' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '>>' requires "#extension GL_EXT_gpu_shader4 : enable" before use
0(1) : warning C7548: '<<' requires "#extension GL_EXT_gpu_shader4 : enable" before use

Sooooo yeah there’s a lot of warnings haha (although other than the version warnings, most of them are cast warnings, which I should get to at some point ^^). Here is the log if I add the #version 420 core directive :

0(1) : error C0205: invalid profile "coreuniform"
0(1) : error C0206: invalid token "sampler2D" in version line

So it seems like the compiler does not recognise the line breaks between the version and the first uniforms… here is the beginning of the GLSL shader that causes this :

#version 420 core

uniform sampler2D screenTexture;
uniform ivec2 screenResolution;
uniform vec4 region = vec4(0, 0, 1, 1);

uniform int seed;
uniform int numOctaves = 1;
uniform vec2 baseFrequency = vec2(0.01, 0.01);
uniform bool type = false;
uniform bool stitchTiles = false;

uniform float time = 0;

uint hash(uint x) {
	x += ( x << 10u );
	x ^= ( x >>  6u );
	x += ( x <<  3u );
	x ^= ( x >> 11u );
	x += ( x << 15u );
	return x;
}
...

If you can see anything wrong

And again thanks for the help !

Alfonse_Reinheart · July 26, 2021, 6:05pm

It’s far more likely that your text loading function is what broke your line breaks. It’s a pretty common problem; lots of people don’t know a simple way to load a whole text file at once, so they do repeated std::getline shenanigans without realizing that it ignores the character that terminates a line.

r-jaoui · July 26, 2021, 9:18pm

Doesn’t the glShaderSource function works as if line breaks were placed between each source line ? Because I’ve verified that the source code makes sense (as in : the lines are as they should be) but there are no linebreak characters since I supposed that the char** as source considers each string as a single line.

If not, I could try to add line breaks at the end of each line of course

EDIT : I did try adding line breaks and it all seems to work now Although of course, I’m forced to use #version 120 which is less than ideal, but this should be dealt with later on

Thanks !

Dark_Photon · July 27, 2021, 1:20am

You mean if you provide an array of multiple char * pointers to glShaderSource()? Nope. From the OpenGL Spec:

Nothing here about inserting newline character(s) between the strings passed in.

When you do insert newlines, just make sure you use the same newline convention used for the other embedded newlines in your source file (e.g. 0x0D 0x0A (DOS/Windows), or 0x0A (UNIX); aka ‘\r\n’ or ‘\n’).

No. Think strcpy + strcat + strcat + strcat …

Good plan.

What!? Why? Just replace this with #version 420 compatibility. That should resolve all of those implicit casting warnings, and you need compatibility due to your use of gl_FragColor.

Alfonse_Reinheart · July 27, 2021, 1:21am

… no. glShaderSource concatenates all of the strings together before it compiles them (or at least, it treats them as if they were concatenated). It doesn’t put anything between them.

The multiple string functionality of glShaderSource is not there so that you can build a list of lines in a file and shove them at OpenGL. It’s there so that you can load multiple files and compile them all into a single shader object. Each string is supposed to be like a separate file, as if included by #include in C and C++. Which as you may know does a straight copy-and-paste of the source file, without inserting any line separators.

Just like OpenGL.

Just read all of the text into a single string with one read. It’s not even that hard. In C++, the code looks like this:

std::stringstream sstr;
sstr << shader_file_stream.rdbuf();
std::string shader_source = sstr.str();

If you don’t like that, there are other alternatives.

GClements · July 27, 2021, 10:17am

Caveat: you need to store the returned std::string in a variable as above, and that variable’s lifetime must include the glShaderSource call. You can’t use e.g.:

char* src = sstr.str().c_str();

because the returned value will be deconstructed immediately and src will be a dangling pointer.

r-jaoui · July 27, 2021, 10:31am

Yeah ! I tried the compatibility profile yesterday with too old of a version I thing, and it didn’t work, but #version 420 compatibility works fine, thanks !

Yup, I’m simply using “\r\n” if this doesn’t cause any issues on UNIX

Yeah, it now works great

As I’m actually using the JOGL Java wrapper, destructors aren’t an issue here for me, but good to know

Thanks to you all :3