Efficiency: vector*matrix or matrix*vector

Completely ignoring the question of transposing the matrix, which is supposed to be faster: left-multiplying a vector or right-multiplying it, relative to a matrix?

I know this is something of an implementation detail. And scalar-based shader systems (G80 and above) generally don’t care. But what’s the answer for vector-based hardware?

MADD must be faster than DOT4.

uniform mat4 mmmmm;

void main(){
	gl_Position = mmmmm * gl_Vertex;

PARAM c[4] = { program.local[0..3] };
ATTRIB vertex_attrib[] = { vertex.attrib[0..0] };
MUL.F R0, vertex.attrib[0].y, c[1];
MAD.F R0, vertex.attrib[0].x, c[0], R0;
MAD.F R0, vertex.attrib[0].z, c[2], R0;
MAD.F result.position, vertex.attrib[0].w, c[3], R0;

Contains shuffle, though. (some if not all RHD need an extra cycle on shuffle iirc).