4x4 matrix multiplication

devdept · July 14, 2021, 10:04am

Before starting to Zoom/Pan/Rotate the scene we explode the scene graph in a long list of root level objects with accumulated (level by level) transformations.

When the number of the scene hierarchy tree levels increase to 10-15 the computation of accumulated transformations becomes a bottleneck.

Currently, we are accumulating transformation as 16 double values to keep precision high.

Is there any trick to improve the 4x4 matrix multiplications? Is it something we can think of moving to the GPU?

Thanks,

Alberto

Dark_Photon · July 14, 2021, 12:19pm

Not really an OpenGL question, but certainly OpenGL-related (scene graph).

A few thoughts for you to consider:

Have you run a CPU profiler on this to see where you’re spending your time? You might be surprised. (e.g. Is it really all matrix mult cost? Or is a good bit of it mem fetch cost?)
How many matrices (transform nodes) are we talking about here?
Are they really all changing, or just one or a few of them. Don’t waste time on work that doesn’t need done.
Is all this really needed every frame (e.g. only a tiny portion is visible)? Could you cull away needless subgraphs that you don’t care about this frame?
Have you pushed static transforms down into the subgraphs and applied them to the vertex positions? This reduces the number of transform nodes required (and can allow you to merge batches with the same material).
Are you storing all of this in a old-school OOP node heap-based pointer-linked scene graph? If so and profiling reveals mem fetch is an issue, consider switching to a struct-of-arrays (SoA) approach with all your node matrices in one array and all your child linkages in another. Your CPU cache and mem prefetcher will thank you for it.
Typically you don’t need double-precision for everything. Just those at the highest levels which place subtrees in a large coordinate frame. So much of this could probably be done in 32-bit float.
Are you using SIMD? (or is your compiler doing it for you under-the-hood?)

devdept · July 14, 2021, 3:10pm

Thanks Dark_Photon,

Are they really all changing, or just one or a few of them. Don’t waste time on work that doesn’t need done.

We flatten the scene tree on mouse down and clear everything on mouse up. Keeping a secondary/twin representation all the time is too much risky (they can go out of sync).

How many matrices (transform nodes) are we talking about here?

We are talking about 120.000

Is all this really needed every frame (e.g. only a tiny portion is visible)? Could you cull away needless
subgraphs that you don’t care about this frame?

During rotation, something can jump back inside the view frustum, so it’s difficult to make good choices.

Typically you don’t need double-precision for everything. Just those at the highest levels which place
subtrees in a large coordinate frame. So much of this could probably be done in 32-bit float.

How do I determine ‘highest’ ?

Thanks again,

Alberto

Dark_Photon · July 14, 2021, 11:41pm

Ok. How about putting BSpheres above large subtrees. During culling, transform the bsphere to eye-space and cull against the view frustum. If the BSphere is outside the frustum, you can prune that whole subtree, including all the matrix transform work within it that you’d otherwise do.

And to your 120,000 transforms, when rotating, you’re just applying a single transform to the precomputed global modeling transforms in a subgraph. Why not pass both of these matrices (precomputed modeling + active rotation) into the shader and let it transform vertex positions (and normals) by both matrices rather than just one. Then there’s no need to, on the CPU, rebuild 120,000 transforms (…even if you were to do no course-grain culling with bounding volumes such as BSpheres, which you probably should).

Based on the size and precision requirements of the parent space.

For instance, if the max size of a subgraph is 1km and you need ~0.01m precision, you’re probably fine using 32-bit single-precision float for it. OTOH, if the size of your parent space is 12,800 km and you need ~0.01cm precision, you’re going to want to step up to a 64-bit double-precision float to represent positions in that space and similarly transforms to/from that space.

devdept · July 15, 2021, 10:01am

A, Uhm… these are both hard to implement in our pipeline…

Is there any chance to skip some cells in this 4x4 multiplication under some circumstances?

    public static Transformation operator *(Transformation left, Transformation right)
    {
        double[,] m = new double[4, 4];

        m[0, 0] = left[0, 0] * right[0, 0] + left[0, 1] * right[1, 0] + left[0, 2] * right[2, 0] + left[0, 3] * right[3, 0];
        m[0, 1] = left[0, 0] * right[0, 1] + left[0, 1] * right[1, 1] + left[0, 2] * right[2, 1] + left[0, 3] * right[3, 1];
        m[0, 2] = left[0, 0] * right[0, 2] + left[0, 1] * right[1, 2] + left[0, 2] * right[2, 2] + left[0, 3] * right[3, 2];
        m[0, 3] = left[0, 0] * right[0, 3] + left[0, 1] * right[1, 3] + left[0, 2] * right[2, 3] + left[0, 3] * right[3, 3];

        m[1, 0] = left[1, 0] * right[0, 0] + left[1, 1] * right[1, 0] + left[1, 2] * right[2, 0] + left[1, 3] * right[3, 0];
        m[1, 1] = left[1, 0] * right[0, 1] + left[1, 1] * right[1, 1] + left[1, 2] * right[2, 1] + left[1, 3] * right[3, 1];
        m[1, 2] = left[1, 0] * right[0, 2] + left[1, 1] * right[1, 2] + left[1, 2] * right[2, 2] + left[1, 3] * right[3, 2];
        m[1, 3] = left[1, 0] * right[0, 3] + left[1, 1] * right[1, 3] + left[1, 2] * right[2, 3] + left[1, 3] * right[3, 3];

        m[2, 0] = left[2, 0] * right[0, 0] + left[2, 1] * right[1, 0] + left[2, 2] * right[2, 0] + left[2, 3] * right[3, 0];
        m[2, 1] = left[2, 0] * right[0, 1] + left[2, 1] * right[1, 1] + left[2, 2] * right[2, 1] + left[2, 3] * right[3, 1];
        m[2, 2] = left[2, 0] * right[0, 2] + left[2, 1] * right[1, 2] + left[2, 2] * right[2, 2] + left[2, 3] * right[3, 2];
        m[2, 3] = left[2, 0] * right[0, 3] + left[2, 1] * right[1, 3] + left[2, 2] * right[2, 3] + left[2, 3] * right[3, 3];

        m[3, 0] = left[3, 0] * right[0, 0] + left[3, 1] * right[1, 0] + left[3, 2] * right[2, 0] + left[3, 3] * right[3, 0];
        m[3, 1] = left[3, 0] * right[0, 1] + left[3, 1] * right[1, 1] + left[3, 2] * right[2, 1] + left[3, 3] * right[3, 1];
        m[3, 2] = left[3, 0] * right[0, 2] + left[3, 1] * right[1, 2] + left[3, 2] * right[2, 2] + left[3, 3] * right[3, 2];
        m[3, 3] = left[3, 0] * right[0, 3] + left[3, 1] * right[1, 3] + left[3, 2] * right[2, 3] + left[3, 3] * right[3, 3];

        return new Transformation(m);
    }

mhagain · July 15, 2021, 10:13am

You could check the transformation and see if it involves rotations, or if it’s just translations-only. If translations-only then you don’t need to do a full 4x4 multiply; it’s just a few adds instead.

You also seem to be hitting memory fairly hard with a lot of allocation of new objects here, so there’s certainly scope for optimizing a lot of that out.

devdept · July 15, 2021, 10:31am

You are right! We will try to improve it.

GClements · July 15, 2021, 7:51pm

One option is to move from 4x4 doubles to 3x3 floats and 3x1 doubles (i.e. just use doubles for the translation part). There’s no point in storing the rotation component as doubles if the inputs (e.g. rotation angles) are single-precision or worse (e.g. user input via a mouse). Storing translation as doubles is often needed to preserve relative positions when adding a large offset.

You almost certainly don’t need the fourth row for a scene graph; it should always be [0 0 0 1].

Yes.

For a transformation hierarchy, you’d typically have an array of parent-relative transformations and another array of parent indices. You can then populate the array of global transformations with:

global_transformation[id] = global_transformation[parent[id]] * local_transformation[id];

You can perform this calculation concurrently for every node on a given level, but the levels need to be processed sequentially so that the parent’s global transformation has been computed before attempting to compute the child’s global transformation.

Alternatively, you can perform the above calculation recursively for every leaf node, at the cost of redundant calculations of higher-level nodes (a non-leaf node’s global transformation is computed by every child). You may be able to optimise this by using atomics to check whether a transformation has already been computed, but I can’t speak to the actual performance; you’d need to measure it.

The arrays of transformations and parent IDs (and possibly other data) should be your primary data structure, not some graph of objects-containing-pointers-to-objects out of OOP 101. If you need such a graph for other purposes (e.g. a public API), the nodes should essentially be proxies for the actual data, obtaining it from the arrays via the node ID.