Say maybe faster processing with hardware(i couldn’t think how that could be possible).

Before we begin, I want to make sure you understand something. Graphics programmers have been using matrices in transformations for at least 3 decades now. Graphics hardware has had matrix transformations built into them, in one way or another, for a decade. People who are paid a lot of money to make things faster have decided that this is the most optimal way of doing things.

So before you start questioning whether people with Ph. D’s and 6–7 figure salaries can actually do their jobs, consider that you perhaps don’t know enough about the subject at hand.

OK, let’s start with a simple problem. You have some vertices and you want to transform them from model space to camera space. All of these vertices will be transformed by the same transformation sequence. So we have one arbitrarily big model that will go through some transformation.

Let’s look at the simplest case: translation. Your transformation will simply be some 3D translation operation.

What’s the per-vertex cost with matrices? Using standard 4D vectors and a standard 4x4 matrix, you have 16 floating-point multiplications and 12 floating-point additions. The per-vertex cost with using the regular math directly is 3 floating-point additions. Seems like a loss for the matrix size.

Let’s say we want to do a rotation. And let’s say that his rotation is axis-aligned. So it is a rotation about the X, Y or Z axis. The matrix cost is 16 multiplies and 12 additions. The regular math cost is 4 multiplies and 2 additions. Again, seems like a loss for matrices.

But how about when we do arbitrary angle/axis rotation, rather than using a cardinal axis? The matrix version is still 16 multiples, 12 additions. But the regular math version jumps to 9 multiplies and 6 additions. A loss for the matrices, but the gap is dwindling.

Now, doing just a rotation is useless; you almost always want some translation in there, yes? So now we do an angle/axis rotation followed by a translation. What’s the cost? Matrices are still 16 multiplies, 12 additions. But the math version is now 9 multiplies and 9 additions.

Fair enough. The regular math wins…

Unless we stop playing around and do something for real.

Consider a hierarchical model of a human. Each transform is relative to its parent transform. Again and again, all the way up to the root. For the typical human figure used in various software, the fingertip transform has about 10 transforms between it and the root (pelvis, lower-spine, mid-spine, upper-spine, clavicle, upper-arm, lower-arm, wrist, finger-joint-1, finger-joint-2, finger-tip).

What’s the cost of doing 10 separate translations+rotations? Well with regular math, you can’t concatenate transforms. So you have to do each one in turn. The overall cost is therefore 10x the cost of doing one transform. So it’s 90 multiplies and 60 additions.

Matrices? Because you can concatenate transforms, it’s just one matrix multiply: 16 multiplies and 12 additions.

Now, you may say that it’s not fair. After all, concatenating those transforms takes time, right? But we’re only looking at per-vertex cost. The cost per-object to do this concatenation, the various matrix multiplies to compute the current matrix, is irrelevant. If each object has a large number of vertices, the overall performance will be governed by the per-vertex cost, not the per-object cost.

If you’re just drawing boxes, then the per-object cost may matter. But if you’re drawing something real, then it almost certainly doesn’t. And if you’re drawing boxes… who cares? My embedded HD 3300 can churn out boxes by the thousand.

Not to mention the simple shader complexity issue. Doing a translation followed by a rotation with regular math requires different shader logic from a rotation followed by a translation. That means you need two different vertex shaders to do these two things. If you need scaling, you now need a third vertex shader. To get all possible orderings of a single scale, translation, and rotation, you need eight shaders.

Do you really need your shaders to actually specifically encode the order of transformations? It’s much simpler to just pass matrix data, where the order of operations is encoded in the matrix (T*R is not the same matrix as R*T).

And even all of that doesn’t change one simple fact: matrix multiplies are really fast.

See, a matrix multiplication is a very simple operation. It is a vector-vector multiplication, followed by 3 vector-vector multiply/add operations. It looks like this:

```
MUL temp, mat.x, vec;
MAD temp, mat.y, vec, temp;
MAD temp, mat.z, vec, temp;
MAD out, mat.w, vec, temp;
```

Each operation is dependent on the previous, but each channel of the values is independent of the last. Because of that, the shader compiler can boil this down into 4 independent sets of 4 multiplies and 3 adds. It can execute each of those in parallel (because that’s what GPUs do). Therefore, this will take no more than 4 cycles to complete (outside of any pipelining tricks and such).

While this works just as well for regular math in theory, because regular math is… well, regular math, the compiler has to do more work to optimize it as well as this. So you need to spend time getting the transform operation executed in exactly the form that the compiler will see and optimize. Compiles know what M*v means, and they’re good at optimizing that. Optimizing more arbitrary code is not as foolproof.

And let’s say you do get it perfectly correct. Let’s say you get the angle/axis + translation optimized perfectly, just like the matrix multiplication case. What do you save?

In many cases, nothing.

Doing a generalized angle/axis transformation is mathematically no different than doing a 3x3 matrix multiply against a 3D vector. You just don’t have that fourth component getting in there. So it’s 1 MUL and 2 MADs, but on 3D rather than 4D vectors. And you need one more opcode to do the addition for the transation, again on 3D coordinates. So it’s 1 MUL, 2 MADs, and one ADD, all 3D instead of 4D.

So, with each opcode, there is the chance to do something with that 4th component (that’s how shaders work on GPUs).

If you’re talking about NVIDIA hardware GeForce 8xxx or above, or ATI hardware of the new Southern Isles, then they can find something to do with the 4th component fairly often. They are really scalar hardware that can execute 4 separate opcodes on a single shader.

However, if you’re looking at any pre-Southern Isles ATI GPU, or any GeForce 7xxx or below, then you’re in trouble. These are vector hardware, so each component of each opcode pretty much has to be doing the same thing. So each operation is 4D, even if you don’t do anything with it. So unless you have some scalar operation somewhere later (that isn’t dependent on the result of this) which needs some MAD, MUL, or ADD work, that fourth component will go unused.

In short: 3D math costs you just as much as 4D math on that hardware. Even on the scalar hardware, if you don’t have any scalar work (or maybe just some vector input-to-output copying), the compiler’s scheduler is going to have a hard time finding a way to put those extra 4 scalar opcodes to good use.

Can you make regular math transformations faster than matrix math? Yes. Particularly for simple 2D transforms. Should you do this in the general case? Absolutely not.