ATI fixed bug but

This code is for testing static loops

const vec2 const0=vec2(0.01, 0.025);
const int loopcount=XXX;

void main()
gl_Position = ftransform();

int i;
vec2 texCoord=vec2(0.0, 0.0);
for(i=0; i<loopcount; i++)

when I used to set loopcount to 30,

Link successful. The GLSL vertex shader will run in software - available number of
temporary registers exceeded. The GLSL fragment shader will run in hardware.

Now it works but if loopcount >= 250, weird things happen.
Polygons are rendered all over the place.

Can you fix it?

Since you are compiling the GLSL code, I assume that you have, at least, a Radeon 9600 and a recent version of ATI Catalyst.
The question, does this code really out of hardware specification.

Here some comparision between different implementation (based from your code).
Cg conversion from your GLSL program. This is the way the nVidia / GLSL will compile your code:

#const c[0] = 0.3 0.7499998
PARAM c[5] = { { 0.29999998, 0.74999982 },
		program.local[1..4] };
MOV result.texcoord[0].xy, c[0];
DP4 result.position.w, vertex.position, c[4];
DP4 result.position.z, vertex.position, c[3];
DP4 result.position.y, vertex.position, c[2];
DP4 result.position.x, vertex.position, c[1];
# 5 instructions, 0 R-regs

We see that the nVidia implementation is pretty clever, because it has precomputed the values for you.
So it will be very fast and it will works on a Geforce 2MX for example.

HLSL version (DX9)

Here the code converted:

const float2 const0=float2(0.01, 0.025);
const int loopcount=30;

void main(uniform float4x4 ModelViewMatrixProj, in float4 gl_Vertex:POSITION, out float4 gl_Position:POSITION, out float2 TexCoord:TEXCOORD0)
  gl_Position = mul( ModelViewMatrixProj, gl_Vertex);
  int i;
  float2 texCoord=float2(0.0, 0.0);
  for(i=0; i<loopcount; i++)

Now converted to Vertex Shader 2.0

// Default values:
//   loopcount
//     i0   = { 30, 0, 1, 0 };
//   const0
//     c4   = { 0.01, 0.025, 0, 0 };

    def c5, 0, 0, 0, 0
    dcl_position v0
    mul r0, v0.y, c1
    mad r0, c0, v0.x, r0
    mad r0, c2, v0.z, r0
    mad oPos, c3, v0.w, r0
    mov r0.xy, c5.x
    rep i0
      add r0.xy, r0, c4
    mov oT0.xy, r0

// approximately 9 instruction slots used

In vertex shader 2.0, the for i/loopcount is using the rep i0/endrep instructions.
But it fits into the vertex shader 2.0 specifications.

So there is indeed a ‘problem’ with the loops implementation in GLSL on ATI/PC (this code would works on MacOS 10.4.3 on ATI)

In a glsl shader on ATI, if i use a const int as the loop count my shader does get unrolled.

If i use a const float it doesn’t.

And, well, using not a constant but a variable doesn’t work at all, but that’s a known issue :frowning:


The loop should not get unrolled as the R300 (9500 to 9800) can do a loop, specially if the loop count is above 256.
We are suppose to be able execute >65000 instructions on VS 2.0 hw, so this is what my test is for.

This is the way the nVidia / GLSL will compile your code:
Yes, that’s one of the things to watch out for. Plug in a large loopcount this time.

The previous ARB vertex program I’ve posted was using the standard ARB_vertex_program, which is not the default on nVidia.

With vp30 (Geforce FX) or vp40 (Geforce 6 here) profiles, it is using a loop instruction : Here the GLSL code that would run on a Geforce 6x or better. No unroll this time (code for Geforce FX is different but similar, using NV_vertex_program).

Note that it’s 12 instructions, even with larger value of count. the limit is 65535 instructions in the shader, no 65535 executions of an instruction.

#var float2 const0 :  : c[6] : -1 : 1
#var int loopcount :  : c[5] : -1 : 1
#const c[4] = 0 1
#default const0 = 0.01 0.025
#default loopcount = 30

OPTION NV_vertex_program3;

PARAM c[7] = { program.local[0..3],
		{ 0, 1 },
		program.local[5..6] };
MOV   R0.xy, c[4].x;
DP4   result.position.w, vertex.attrib[0], c[3];
DP4   result.position.z, vertex.attrib[0], c[2];
DP4   result.position.y, vertex.attrib[0], c[1];
DP4   result.position.x, vertex.attrib[0], c[0];
MOV   R0.z, c[4].x;
SLTC  CC.x, R0.z, c[5];
BRA   BB4 (EQ.x);
ADD   R0.xy, R0, c[6];
ADD   R0.z, R0, c[4].y;
BRA   BB2;
MOV   result.texcoord[0].xy, R0;
# 12 instructions, 1 R-regs

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.