Since you are compiling the GLSL code, I assume that you have, at least, a Radeon 9600 and a recent version of ATI Catalyst.
The question, does this code really out of hardware specification.
Here some comparision between different implementation (based from your code).
Cg conversion from your GLSL program. This is the way the nVidia / GLSL will compile your code:
!!ARBvp1.0
#const c[0] = 0.3 0.7499998
PARAM c[5] = { { 0.29999998, 0.74999982 },
program.local[1..4] };
MOV result.texcoord[0].xy, c[0];
DP4 result.position.w, vertex.position, c[4];
DP4 result.position.z, vertex.position, c[3];
DP4 result.position.y, vertex.position, c[2];
DP4 result.position.x, vertex.position, c[1];
END
# 5 instructions, 0 R-regs
We see that the nVidia implementation is pretty clever, because it has precomputed the values for you.
So it will be very fast and it will works on a Geforce 2MX for example.
HLSL version (DX9)
Here the code converted:
const float2 const0=float2(0.01, 0.025);
const int loopcount=30;
void main(uniform float4x4 ModelViewMatrixProj, in float4 gl_Vertex:POSITION, out float4 gl_Position:POSITION, out float2 TexCoord:TEXCOORD0)
{
gl_Position = mul( ModelViewMatrixProj, gl_Vertex);
int i;
float2 texCoord=float2(0.0, 0.0);
for(i=0; i<loopcount; i++)
{
texCoord+=const0;
}
TexCoord.xy=texCoord;
}
Now converted to Vertex Shader 2.0
// Default values:
//
// loopcount
// i0 = { 30, 0, 1, 0 };
//
// const0
// c4 = { 0.01, 0.025, 0, 0 };
//
vs_2_0
def c5, 0, 0, 0, 0
dcl_position v0
mul r0, v0.y, c1
mad r0, c0, v0.x, r0
mad r0, c2, v0.z, r0
mad oPos, c3, v0.w, r0
mov r0.xy, c5.x
rep i0
add r0.xy, r0, c4
endrep
mov oT0.xy, r0
// approximately 9 instruction slots used
In vertex shader 2.0, the for i/loopcount is using the rep i0/endrep instructions.
But it fits into the vertex shader 2.0 specifications.
So there is indeed a ‘problem’ with the loops implementation in GLSL on ATI/PC (this code would works on MacOS 10.4.3 on ATI)