Nvidia Compiler "Optimization" : running out of temporaries!

Hi ,
currently i`m trying to abuse the graphics hardware to do image convolution plus some additional calculation with the convolution result:
To be precise the algorithm has 3 passes which are :

  • calculate x and y derivates step 1
  • calculate x and y derivates step 2
  • for every pixel sum the 3 possible products between dx and dy in the neighbourhood of each pixel

Pass 1 and 2 work like a charm, number 3 does not because the compiler is allocating an excessive number of temporaries , without any need to do so …
Here is the GLslang code :

  
const int KERNEL_WIDTH = 5 ; 
const int KERNEL_HEIGHT = 5 ;
uniform samplerRECT textures[1]; 
uniform vec4 kernel [4][3];
void main(void)	
{
    int i,j,k;					
    vec4 accum[3] ;	
    vec2 addr = gl_TexCoord[0].xy;		
	for (i=0;i<KERNEL_HEIGHT;i++)	
	{					
	    addr.x = gl_TexCoord[0].x;		
            for (j=0;j<KERNEL_WIDTH;j++)
	    {
		vec4 tex =  texRECT (textures[0],addr);
		half4	gx,gy;	
		gx.xy = unpack_2half (tex.x);	
         	gx.zw = unpack_2half (tex.y);
		gy.xy = unpack_2half (tex.z);
		gy.zw = unpack_2half (tex.w);
		vec4 gxx = gx * gx ;	
		vec4 gxy = gx * gy ;
		vec4 gyy = gy * gy ;	
		for (k=0;k<4;k++)
		{
		    accum[0][k]+= dot (gxx * gxy , kernel[k][j] );
		    accum[1][k]+= dot (gxy , kernel[k][j] );
		    accum[2][k]+= dot (gyy , kernel[k][j]);	
		}
		addr.x+=1;
	}
	addr.y+=1;
	vec4 gxx = vec4 (accum[0][0],accum[0][1],accum[0][2],accum[0][3]);
	vec4 gxy = vec4 (accum[1][0],accum[1][1],accum[1][2],accum[1][3]);	
	vec4 gyy = vec4 (accum[2][0],accum[2][1],accum[2][2],accum[2][3]);		
	gl_FragColor = (gxx + gyy - sqrt((gxx - gyy)*(gxx - gyy) + 4*gxy*gxy))/2.0;
} 

Now , looking at the asm ouput ( i can provide it but its rather lengthy ) the compiler tries to make the texture fetches independent from each other by using different temporary variables for the accums and summing them up only at the very end. By this for every loop iteration another set of 3 temporaries is used , and so im quickly hitting the 32 tmps boundary for Geforce FX .
Has anyone experienced similar behaviour ? Is There a way to tell the compiler not to use this “optimization” , or better : shouldnt the compiler be clever enough to disable it when it is running out of temps when using the optimization ? Im using a GeForce FX 5600 with Forceware 66.32.
Best Regards ,
Martin Kraus

Im not sure, but try to move tex, gx and gy on top of shader. Comipler does loop unrolling and maybe it’s not smart enough to recognize some vaiable goes off the scope.

I was try to compile your shader but there is a syntax problems (sampler2DRectinstead of samplerRECT ,…), so please post your original shader.

yooyo

OK… after fixing shader it is able to compile. Just move tex, gx, gy, gxx, gxy, gyy on top of shader.

yooyo

Hi ,
thanks for your answer , i will try this tomorrow when im at work . This is the original shader , but both the unpacking funtion and texRECT are "borrowed" from cg , the nvidia compiler supports mixing the two ... at least for pack / unpack i dont think there is anything in GLSL that can do this …
P.S: the ideal parameters id like to use would be 11 for KERNEL_HEIGHT ... i did some reordering of the code to use less instructions .. ill update this tomorrow .
Bye ,
Martin Kraus

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.