SSE and row or column matrices?


I’m writing some SSE code to optimize some math stuff like matrices.

I am using column order matrices by that doesn’t mean I have to store them column ordered.

Column ordered matrices with row storage are so efficient for matrix vector product for example. In other hand, position access and matrix by matrix products seams more effective with column storage.

Any advice or experiences on that?


Of course it does - that’s exactly what “column ordered” (by which I assume you mean the more standard term column major) means. A column major matrix is a matrix that is stored in a 1D array as a series of columns. A row major matrix is stored as a series of rows. However, for a proper matrix class API, the storage mechanism is irrelevant, as you said.

A matrix-matrix product is equivalent to multiple matrix-vector products, so if matrix-vector is efficient, then so is matrix-matrix by corollary. And how does position access change between row or column major storage?

As stephen mentioned, this is confused. Storage order is storage order.

What you may be referring to is the fact that these two:
[ul][] row-major, vector-on-the-left (e.g. v1 * MVP = v2)[] column-major, vector-on-the right (e.g. PVM * v1 = v2)[/ul] result in storing the elements of your matrices in <u>exactly</u> the same memory locations. This is the usual convenient way to reconcile using OpenGL’s column-major storage order with C’s row-major storage order.

Storage order is definitely NOT major order:
(x, y, z, p)
(x, y, z, p)
(x, y, z, p)
(0, 0, 0, 1)
This is a column major matrix, x, y, z are fresnel vectors and p, the position.

(x, x, x, 0)
(y, y, y, 0)
(z, z, z, 0)
(p, p, p, 1)
This is a row major matrix but written this way, it doesn’t imply at all how they are stored.

If it is column ordered matrix this means:
__m128 data[4]
data[0] == “x0, x1, x2, x3”; Yes we can write it that way but that the way it is.

This a column major matrix is row major ordered then:
data[0] == “x0, y0, z0, w0”

This make possible to do something like dot(data[0], vec4) directly as far as the first row of the matrix if directly in a sse register like vec4.

Basically a code like that:
mov esi, data
mov edi, vec4
movaps xmm0, [esi]
movaps xmm1, [edi]
mulps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, _MM_SHUFFLE(2, 3, 0, 1)
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, _MM_SHUFFLE(0, 1, 2, 3)
addps xmm0, xmm1
movss result, xmm0

With column major and ordered matrix the issue is that we need first some swizzling operation to store every row components in a single sse register… that kind of slow according the cost of a matrix vector product. 12 cycles and 5 registers used for this operation when a column major and row ordered matrix vector product cost “only” 47 cycles on a Athlon X2 6000+.

This is just an example!

So well I’m looking for some feedback of people how already did these experiments, a very valuable work to me.


Column major and row major are irrespective of the data stored within the matrix. What you’re talking about is particular to your problem - it is not a general concept in matrix math. Specifically, for a matrix M:

[[ m11, m12, m13, m14 ]
 [ m21, m22, m23, m24 ]
 [ m31, m32, m33, m34 ]
 [ m41, m42, m43, m44 ]]

Storing the matrix row-major means the layout in a 1D array would be

[ m11, m12, m13, m14, m21, m22, m23, m24, m31, m32, m33, m34, m41, m42, m43, m44 ]

whereas storing the matrix column-major means the layout would be

[ m11, m21, m31, m41, m12, m22, m32, m42, m13, m23, m33, m43, m14, m24, m34, m44 ]

You’ll notice that these definitions are irrespective of what the values of mij actually are. Your issue is you are storing your matrix of “fresnel vectors” (I don’t know what these are and a google search turned up no useful answers), and you don’t know if you want the matrix or its transpose. That I can’t answer for you.

You can look at the operations you want to do more closely however. E.g. matrix-vector multiplication, where the matrix is multiplied on the right by a column-vector:

           M             *   u    =   v
[[ m11, m12, m13, m14 ]    [ u1 ]   [ v1 ]
 [ m21, m22, m23, m24 ]  * [ u2 ] = [ v2 ]
 [ m31, m32, m33, m34 ]    [ u3 ]   [ v3 ]
 [ m41, m42, m43, m44 ]]   [ u4 ]   [ v4 ]

You can look at this operation in two different ways. Either you can see it as a series of dot products:

v1 = dot( m1j, u )
v2 = dot( m2j, u )
v3 = dot( m3j, u )
v4 = dot( m4j, u )

or you could see it as a series of scaled vector additions:

v = mi1 * u1
v += mi2 * u2
v += mi3 * u3
v += mi4 * u4

Obviously, the first technique addresses the matrix row-by-row, so would need a row-major matrix for efficient access (to avoid a million swizzles), and the series of scaled vector adds needs a column-major matrix (for the same reason). Given the correct storage, there isn’t a single dot-product SSE instruction, so for each dot product, you have to do a multiply, then a horizontal add (a few SSE ops depending on the level of SSE available), whereas scaled vector adds are quick and easy. Therefore, for right-multiplication of matrices by column vectors, column-major is the more efficient storage option.

The process is the same to show that for left-multiplication by row-major vectors, row-major storage is more efficient, but I’ll leave that as an exercise to the reader. =) And again, like I said before, matrix-matrix multiplication is just multiple matrix-vector multiplies, so that’s what determines how efficient it will be. And accessing individual elements is the same in both storage formats.

You’re 100% correct about the matrix-vector multiplications. But for matrix-matrix multiplications row-major or column major storage won’t make a difference.

Makes sense, since it needs to access a row from one matrix and a column from another matrix to calculate one output value both storage modes will be equally efficient.
| A11 A12 A12 A14 | | B11 B12 B12 B14 |   | C11 C12 C12 C14 | 
| A21 A22 A22 A24 | | B21 B22 B22 B24 |   | C21 C22 C22 C24 | 
| A31 A32 A32 A34 |*| B31 B32 B32 B34 | = | C31 C32 C32 C34 | 
| A41 A42 A42 A44 | | B41 B42 B42 B44 |   | C41 C42 C42 C44 | 

//column major
Ci1 = B11*Ai1 + B21*Ai2 + B31*Ai3 + B41*Ai4
Ci2 = B12*Ai1 + B22*Ai2 + B32*Ai3 + B42*Ai4
Ci3 = B13*Ai1 + B23*Ai2 + B33*Ai3 + B43*Ai4
Ci4 = B14*Ai1 + B24*Ai2 + B34*Ai3 + B44*Ai4

//row major
C1i = A11*B1i + A12*B2i + A13*B3i + A14*B4i
C2i = A21*B1i + A22*B2i + A23*B3i + A24*B4i
C3i = A31*B1i + A32*B2i + A33*B3i + A34*B4i
C4i = A41*B1i + A42*B2i + A43*B3i + A44*B4i

Ah, you’re right of course, you can look at it as a matrix on the left multiplied by multiple column vectors on the right, or a matrix on the right multiplied by multiple row vectors on the left, so whatever your storage, you just do four fast matrix-vector multiplies for a matrix-matrix multiply.

I’m just used to column major and column vectors, so thinking about row major and row vectors is a little weird for me. =)

I hear ya. I’m still using glLoadTransposeMatrix and GL_TRANSPOSE_*_MATRIX because I’m used to row major storage and I don’t want to break my head over this :slight_smile:

I suggest reading the whole series, but I think this might be insightful for you:

I can’t agree on that. row or column major matrices is a mathematical concept.
If matrices are column major
M = M1 * M2 * M3
else if matrices are row major
M = M3 * M2 * M1

It doesn’t mean that the storage “a computer concept” have to be the same way. I prefer the usual mathematical view of a matrix, column major, but that not sure that’s the more efficient. It’s need to feet with the SSE registers to prevent swizzling. I have an other good example with matrix invert that need 53 products but 80 swizzling…

“Storage” as you put it affects the definition of operations over matrices.

Consider the case where I create a mapping from one matrix layout to a another, as well as the inverse of the mapping. If I take an input matrix, apply my mapping so the elements are reordered, perform some operation without respect to my mapping, then take the inverse mapping, I’m going to (likely) get a different result than if I had just performed the operation on the original matrix. Stating that your physical layout and your conceptual layouts are different is fine, you’ve just created such a mapping.

As such, we see that “storage” is as much a mathematical concept (it defines a mapping) as it is “a computer concept”. Of course you’re welcome to redefine your operations with respect to your mapping.

Which brings me back to the article I posted above. In stating that you’re thinking about efficiency in terms of matrix layout, you’re thinking about memory access and performance.

If your matrices were large enough, you should expect to see a performance increase if you transposed one of them first so you could access them both in linear order. If you’re doing 4x4 then I don’t expect they’re large enough to see such a benefit, but try it out and let us know.


Okay, so is the matrix

[[ 1 2 3 4 ]
[ 5 6 7 8 ]
[ 7 6 5 4 ]
[ 3 2 1 0 ]]

row ordered or column ordered? Please give a reference to a definition of this concept.

That’s pretty much the idea. Matrix transpose remains quite expensive even shuffle instructions are became really not expensive (it’s lower the fun too but anyway…). I had astonishing results from matrix and vector product, row or column ordered. The codes are very different but efficiency is close.