Hi there

Some time ago i tried glSlang on my Radeon 9700 and back then it was not very usable.
Well, yesterday i tried again, hoping that it has improved since then.
Well, my first impression is, that it has in fact improved quite a bit, however, there are still some annoying things.

At the moment i am doing skinning on the GPU. Well, the easiest method is to use a for-loop. However, the hardware seems not to support this, my vertex-shader falls back to software rendering. Well, i can live with that, that’s a hw limitation and i COULD work around that (static loops will be unrolled, no?)

Anyway, another problem with skinning is the amount of data for the bones. A 4x4 matrix takes 16 uniforms, and my card seems to support 1024 uniforms which makes 64 bones (61 in reality).

Now, i THOUGHT i could reduce this, by sending the rotation as 3x3 and translation as vec3, reducing one bone to 12 uniforms, which should allow me to use around 80 bones.

However, after decompositing and doing the extra maths in the shader i increased the bone-count in the shader and surprise! i did not get more than 60 bones. In fact, with 4x4 matrices i could get 61, with 3x3 + vec3 i get 60 bones.

Now THAT’s not a hw limitation anymore, i think. It SHOULD support more bones.

This sounds, as if every data-element needed to be “aligned” at some even address (multiple of 16 for ALL matrices, multiple of 4 for ALL vectors???).

If that’s true ATI is limiting the potential of their cards quite heavily.

Can anyone confirm this? Is this the same on Geforce 6+ ?


I guess quaternions will help you use more slots. Also, quaternions avoid the well-known gimbal lock issue. The main problem with 3*3 matrices as far as I know is that they don’t align well in memory, so there could have many memory losses.

I’d like to try that out, but all articles i could find don’t show how you actually rotate a vertex by a quaternion.

In fact, i already use quaternions to interpolate animations. But to actually deform the mesh i convert the quaternions into matrices, because i don’t know how to directly apply the rotation to the vertex through the quaternion.

If you have a link to where it is explained, please post it, i will keep searching for it.


Ok, the first 10 articles i found didn’t mention it at all, but immediatly after posting here, the next article i found explains it in detail… doh! :smiley:

Thanks, anyway.

Your 3x3 is probably being represented as a 3x4, so when you add the translation column separately, you’re up to 4x4 (effectively) space. Why not represent your matrices as 3x4? That should be fine if you don’t want to generate an output w coordinate and you still get translation & rotation.

You have 256 uniforms (sometimes 96 on old hardware) not 1024.
1 uniforms = 1 vec4 (4-floats).
You cannot use smaller than this.
There is the same limitation on DX9 and on all hardware.

So it is not an ATI related issue. it’s generic.

And yes, you cannot use the loop with GLSL on ATI. It’s badly implemented (get lack of temporary registers and seems to unroll the shader, instead of using the loop instructions which are part from Shader Model 2.0 specs.) : Using the same shader, ported on DX9/HLSL will works on same hardware and get full acceleration.

You cannot use Cg/OpenGL, because on ATI, Cg will try to use ARB_vertex_program, which is hopeless without GL_NV_vertex_program2_option support (giving you the loop support, A0 index register with 4 components instead of one, very importants for a good skinning shader code).

So the options are few …

Originally posted by execom_rt:
You have 256 uniforms (sometimes 96 on old hardware) not 1024.
1 uniforms = 1 vec4 (4-floats).

The glsl spec says, that an implementation needs to support 512 single-component (1 float) uniforms (MAX_VERTEX_UNIFORM_COMPONENTS_ARB).

That means up to 32 matrices (16 floats). I get up to 61 matrices => 1024 floats available.

You are certainly right, that in practice one uniform is always = 4 components, so you might not be able to get more than 128 uniforms on hw that supports 512 floats, but that’s a different thing.

Anyway, thanks for the information.


Yes you are right about the 1024 vec4 uniforms (or 4096 floats) for ARB_vertex_shader.

It’s ARB_vertex_program who is limitated to 256 vec4 uniforms.

Ok, the first 10 articles i found didn’t mention it at all, but immediatly after posting here, the next article i found explains it in detail… doh!

Thanks, anyway.

I don’t suppose you could post that article??

I implemented a Quaternion-based HW skinning solution a couple of months ago. Unfortunately, as soon as I modified a single uniform the shader kicked back to software rendering (Radeon 9800 Pro). It still ran at a reasonable proof-of-concept rate of roughly 20fps (with nothing else). I have been able to work with CPU skinning since then, but I’d like to know what could have been going wrong there.

// Mike


My shader:


//uniform mat4 BoneMatrices[10];
uniform vec4 BoneRotations[100];
uniform vec3 BoneTranslations[100];

vec4 mult (vec4 q1, vec4 q2)
	vec4 r;
	r.w = q1.w*q2.w - dot (,; = q1.w * + q2.w * + cross (,;
	return (r);

vec3 rotate (vec4 q, vec3 v)
	vec4 r = mult (q, vec4 (v.x, v.y, v.z, 0.0));
	r = mult (r, vec4 (-q.x, -q.y, -q.z, q.w));
	return (;

vec3 doit (vec4 q, vec3 t)
	vec3 r = rotate (q,;

	return (r + t);

void main (void)
    vec4 Position = vec4 (0.0, 0.0, 0.0, 1.0);

	vec4 Weight = gl_MultiTexCoord1;
	vec4 Index  = gl_MultiTexCoord0;
//	vec4 Number = gl_MultiTexCoord2;
	const int iLoop = 4;

//	int iLoop = int (Number.x);
	for (int i = 0; i < iLoop; i += 1)
// = (Position + Weight.x * (BoneMatrices[int (Index.x)] * gl_Vertex)).xyz; = ( + Weight.x * doit (BoneRotations[int (Index.x)], BoneTranslations[int (Index.x)])).xyz;
		Weight = Weight.yzwx;
		Index = Index.yzwx;

	vec4 temp = (gl_ModelViewProjectionMatrix * Position);

    gl_Position = temp;

Hope that helps you. With a fixed loop-count, this shader runs in hardware on my Radeon 9700.


This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.