Inverse of Matrix 4x4 using partitioning in Altivec

(Please see for the typedefs and definitions used.)

We tackle the 4x4 matrix inversion using the matrix partitioning method, as described in the "Numerical Recipes in C" book (2nd ed., though I guess it will be similar in the 3rd edition). Using the AltiVec SIMD unit, we achieve almost 300% increase in performance, making the routine the fastest -at least known to us, matrix inversion method!

The function prototype will be:

void Mat44Inverse(Mat44 minv, Mat44 mat);

Math intro

It is proven (it's not in the scope of this article nor this site, to provide mathematical proofs for well-known subjects, check a Linear Algebra book), that one of the methods to get the inverse matrix of $\textbf{M}$ is matrix partitioning.
Let's begin by describing the math behind this matrix inversion method, first. Let's assume a matrix $\textbf{M}$:

\textbf{M} =
a_1 & a_2 & a_3 & a_4\\
b_1 & b_2 & b_3 & b_4\\
c_1 & c_2 & c_3 & c_4\\
d_1 & d_2 & d_3 & d_4

We partition the 4x4 matrix into 4 2x2 matrices, thus:

\textbf{M} =
\textbf{P} & \textbf{Q}\\
\textbf{R} & \textbf{S}\\


\textbf{P} =
a_1 & a_2\\
b_1 & b_2
\textbf{Q} =
a_3 & a_4\\
b_3 & b_4
\textbf{R} =
c_1 & c_2\\
d_1 & d_2
\textbf{S} =
c_3 & c_4\\
d_3 & d_4

The inverse matrix $\textbf{M}^{-1}$ can be calculated with the following formulas:

\textbf{M}^{-1} =
\tilde{\textbf{P}} & \tilde{\textbf{Q}}\\
\tilde{\textbf{R}} & \tilde{\textbf{S}}


\tilde{\textbf{P}} & = \textbf{P}^{-1} - (\textbf{P}^{-1} \cdot \textbf{Q}) \cdot \tilde{\textbf{R}}\\
\tilde{\textbf{Q}} & = -(\textbf{P}^{-1} \cdot \textbf{Q}) \cdot \tilde{\textbf{S}}\\
\tilde{\textbf{R}} & = -\tilde{\textbf{S}} \cdot (\textbf{R} \cdot \textbf{P}^{-1})\\
\tilde{\textbf{S}} & = (\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q})^{-1}\\

In this rationale, the determinant of $\textbf{M}$ can be calculated with the following formula:

|\textbf{M}| = |\textbf{P}| \cdot |\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q}|

One might hastily deduct that if $|\textbf{P}| = 0$ then the $|\textbf{M}| = 0$ and that the matrix has no inverse. However this matrix partitioning method doesn't catter for special cases, like this one:

\textbf{M} =
1 & 1 & 0 & 0 \\
1 & 1 & 1 & 0 \\
0 & 1 & 1 & 0 \\
0 & 0 & 0 & 1

for which $|\textbf{O}| = 0$ but $|\textbf{M}| = -1$, and $\textbf{M}$ has an inverse. Well, in that particular case, we have to call a generic -but slower- function. But for the majority of matrices used eg. in OpenGL/3D do not fall in this category and matrix inversion will benefit greatly from our method (benchmarks at the bottom of this page).


So, the workflow is this:

  • Calculate $|\textbf{P}|$ ($\textbf{P}$'s determinant). If that is equal to zero, use another more generic method.

  • Since $|\textbf{P}| \neq 0$ get its inverse $\textbf{P}^{-1}$. (Actually get its adjoint first, and then divide this with $|\textbf{P}|$.). This is a 2x2 matrix so getting its inverse will be trivial.

  • Calculate in sequence the matrix products $\textbf{R} \cdot \textbf{P}^{-1}$, $\textbf{P}^{-1} \cdot \textbf{Q}$ and $\textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q}$.

  • Subtract the result from $\textbf{S}$. We now have the quantity $(\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q})$.

  • Get the determinant of this quantity. If it is zero, then $|\textbf{M}| = 0$ and the matrix definitely does not have an inverse (just return NULL in that case).

  • Since it is non-zero, we can calculate its inverse (using the same method as used in calculating $\textbf{P}^{-1}$). Negate the result and we have calculated $\tilde{\textbf{S}}$.

  • Now that we have all the needed quantities, calculate in this sequence the matrices $\tilde{\textbf{R}}$, $\tilde{\textbf{Q}}$ and $\tilde{\textbf{P}}$.

  • Final stage, merge these 2x2 matrices into the final 4x4 $\textbf{M}^{-1}$.

Calculate $|\textbf{P}|$

Ok, just as in the rest of the routines, first thing is to load the matrix Mat44 mat into vector registers:

        vector float vm_1, vm_2, vm_3, vm_4, vp_1, vp_2, vq_1, vq_2,
                     vpq_1, vpq_2, vrp_1, vrp_2,
                     vrpq_1, vrpq_2, vsrpq_1, vsrpq_2, vt_1, vt_2, vt_3, vt_4,
                     vdet, vdetSRPQ, invdet, v0f, v05;
        vector unsigned int v1_, v0, v0_, u_znzn, u_nznz;
        // Load matrix
        LOAD_MATRIX(mat, vm_1, vm_2, vm_3, vm_4);

Let's define and set a few constants, these will be used eventually:

// Useful constants, esp the v0_ is used to fix the +/- sign to some float vectors
        v0 = vec_splat_u32(0);
        v1_ = vec_splat_u32(-1);
        v0_ = vec_sl(v1_, v1_);
        v0f = (vector float)v0;
        u_nznz = vec_mergeh(v0_, v0);
        u_znzn = vec_mergeh(v0, v0_);

(Ignore these for now).

Vectors vm_1 and vm_2 carry $\textbf{P}$, let's show how:

vm_1 = | a_1 | a_2 | . | . |
vm_2 = | b_1 | b_2 | . | . |

It's convenient to merge this into one vector, and AltiVec has just the instruction for it: vec_mergeh.

        // Calculate determinant of submatrix P
        // | a1 a2 .  . | -> mergeh -> | b1 a1 b2 a2 |
        // | b1 b2 .  . |
        vt_1 = vec_mergeh(vm_2, vm_1);

Now do a bit of shifting and multiply the resulting vector with vt_1:

        // shift left by 32-bits -> | a1 b2 a2 b1 |
        vt_2 = vec_sld(vt_1, vm_2, 4);
        // multiply add vt_1 x vt_2 -> | a1b1 | a1b2 | b2a2 | a2b1 |
        vt_3 = vec_madd(vt_1, vt_2, v0f);

Again, shift vt_3 by 8 bytes and subtract from itself:

        // shift left vt_3 by 64-bits - > | b2a2 | a2b1 | 0 | 0 |
        vt_4 = vec_sld(vt_3, vt_3, 8);
        // subtract vt_3 from vt_4 -> | .. | a1b2-a2b1 | .. | .. |
        vdet = vec_sub(vt_3, vt_4);

Now the 2nd float (index 1) in the vector contains $|\textbf{P}|$, splat it to the vector:

        // Splat det
        vdet = vec_splat(vdet, 1);

Testing equality with zero is easy on AltiVec:

if (vec_all_eq(vdet, v0f)) return NULL;

(Actually, we have to call a generic Mat44InverseGeneric() function in this case)

Calculate $\textbf{P}^{-1}$

Next step is to get $\textbf{P}^{-1}$ or rather his adjoint and then divide this with the determinant. The adjoint for the 2x2 $\textbf{P}$ is defined as:

\textbf{P}^{T} =
b_2 & -a2\\
-b_1 & a1

So we need two vectors with the following values:

vp_1 = |  b_1 | -a_2 | . | . |
vp_2 = | -b_1 |  a_2 | . | . |

To get the adjoint, one obvious way to do it, would be using AltiVec's vec_perm(), but there is an easier way and it saves us from loading the permute masks from memory, vec_mergeh(), which we already used. So, assuming vp_1, vp_2, will be the vectors holding the adjoint/inverse $\textbf{P}^{-1}$ matrix, we get vp_2 first and then right shift if to get vp_1.

        // vp_2 -> vec_mergeh(vm_2, vm_1) -> | b1 a1 b2 a2 |
        // vp_1 -> vp_2 << 64 = | b2 a2 0 0 |
        vp_2 = vec_mergeh(vm_2, vm_1);
        vp_1 = vec_sld(vp_2, vp_2, 8);
        // Fix the signs in vp_1:
        // vp_1 -> |  b2 -a2 .  .|
        // vp_2 -> | -b1  a1 .  .|
        vp_1 = (vector float) vec_xor((vector unsigned int) vp_1, u_znzn);
        vp_2 = (vector float) vec_xor((vector unsigned int) vp_2, u_nznz);

Now that we have $\textbf{P}^{T}$, we can simply multiply each element with $1/|\textbf{P}|$ to get $\textbf{P}^{-1}$:

        // Get 1/detP
        invdet = vec_re(vdet);
        vp_1 = vec_madd(vp_1, invdet, v0f);
        vp_2 = vec_madd(vp_2, invdet, v0f);

Note: vec_re(x) intrinsic produces a reciprocal approximation of $1/x$, with an accuracy of $1/4096$. While this is adequate in most cases, higher accuracy involves a subsequent Newton-Raphson approximation after the reciprocal approximation. The version found in also does not do further approximation (only one call to vec_re() though).

Calculate the products $\textbf{R} \cdot \textbf{P}^{-1}$, $\textbf{P}^{-1} \cdot \textbf{Q}$ and $\textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q}$

Now that we know $\textbf{P}^{-1}$, it's easy to do the calculations. However, we should make a note about the multiplication. The product matrices are 2x2, but we have 2x4 matrices (2 SIMD vectors actually). Will that affect us? Well, in truth no, we'll just ignore the rest 2x2 submatrix in the vector without it affecting the result. But let us see this in practice:

In we've seen how the matrix multiplication works in AltiVec. Here we will use the same technique, but only for 2 vectors at a time (remember it's a 2x2 matrix).

Let's start with the $\textbf{R} \cdot \textbf{P}^{-1}$ product. $\textbf{R}$ exists in the leftmost part of the vectors vm_3, vm_4:

vm_3 = | c_1 | c_2 | . | . |
vm_4 = | d_1 | d_2 | . | . |

Do the multiplication:

        // The product is done like for a 4x4 matrix but only for the first 2 lines
        // and the first 2 columns
        // Calculate the first column of p1
        vt_1 = vec_madd( vec_splat( vm_3, 0 ), vp_1, v0f );
        vt_2 = vec_madd( vec_splat( vm_4, 0 ), vp_1, v0f );
        // Multiply the 2nd column of p1, add to previous result
        vrp_1 = vec_madd( vec_splat( vm_3, 1 ), vp_2, vt_1 );
        vrp_2 = vec_madd( vec_splat( vm_4, 1 ), vp_2, vt_2 );

Let's use this result to calculate $\textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q}$:

        // Calculate R*P^(-1)*Q, just multiply previous result with Q
        // using same method
        vrpq_1 = vec_madd( vec_splat( vrp_1, 0 ), vm_1, v0f );
        vrpq_2 = vec_madd( vec_splat( vrp_2, 0 ), vm_1, v0f );
        // Multiply the 2nd column of p1, add to previous result
        vrpq_1 = vec_madd( vec_splat( vrp_1, 1 ), vm_2, vrpq_1 );
        vrpq_2 = vec_madd( vec_splat( vrp_2, 1 ), vm_2, vrpq_2 );

Finally, let's calculate $\textbf{P}^{-1} \cdot \textbf{Q}$. In this case however, we have to left shift the contents of vm_1 and vm_2, since $\textbf{Q}$ takes the right 2x2 part of the vectors.

        // Calculate P^(-1)*Q also
        vq_1 = vec_sld(vm_1, v0f, 8);
        vq_2 = vec_sld(vm_2, v0f, 8);

Now, vq_1, vq_2 have the following values:

vq_1 = | a_3 | a_4 | . | . |
vq_2 = | b_3 | b_4 | . | . |

Do the calculation:

        // Calculate P^(-1)*Q also
        vpq_1 = vec_madd( vec_splat( vp_1, 0 ), vq_1, v0f );
        vpq_2 = vec_madd( vec_splat( vp_2, 0 ), vq_1, v0f );
        // Multiply the 2nd column of p1, add to previous result
        vpq_1 = vec_madd( vec_splat( vp_1, 1 ), vq_2, vpq_1 );
        vpq_2 = vec_madd( vec_splat( vp_2, 1 ), vq_2, vpq_2 );

All the products have been calculated, let's proceed.

Calculate $(\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q})$

That part is easy, just subtract the result from $\textbf{S}$:

        // S - R*P^(-1)*Q
        // Finally, just subtract the previous result from S
        vsrpq_1 = vec_sub(vm_3, vrpq_1);
        vsrpq_2 = vec_sub(vm_4, vrpq_2);

Get the determinant $|\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q}|$

We need to rearrange the elements into a single vector to make the calculation easier. Right now, the $(\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q})$ matrix is in the right-most 2x2 part of vsrpq_1 and vsrpq_2. We need to get both of these in a single line:

vsrpq_1 = | . | . | s_1 | s_2 |
vsrpq_2 = | . | . | s_3 | s_4 |

First shift left vsrpq_2 by 64-bits (2 32-bit floats) and then shift left vsqpq_1 again by 64-bits and combine with vt_2 (vec_sld() allows to shift and combine).

        vt_2 = vec_sld(vsrpq_2, v0f, 8);
        vt_1 = vec_sld(vsrpq_1, vt_2, 8);

Now vt_1 holds the data in the following form:

vt_1 = | s_1 | s_2 | s_3 | s_4 |

To save on some instructions we'll use vec_nmsub(). Let's describe the procedure in detail:

        // Get the determinant of (S - R*P^(-1)*Q) and multiply with detSRPQ
        vt_3 = vec_sld(vt_1, vt_1, 4);

vt_3 is rotated by 32-bits:

vt_3 = | s_2 | s_3 | s_4 | s_1 |

We multiply vt_1 and -vt_3 (nmsub negates before multiplying), so vt_4

        vt_4 = vec_nmsub(vt_1, vt_3, v0f);

So vt_4 holds the following values:

vt_4 = | s_1s_2 | s_2s_3 | s_3s_4 | s_4s_1 |

It should become obvious what we're trying to achieve, but we'll say it anyway. :)
We'll rotate left vt_4 by 64-bits and then subtract from itself:

        vt_3 = vec_sld(vt_4, vt_4, 8);
        vdetSRPQ = vec_sub(vt_3, vt_4);

Now vdetSRPQ holds the following value in the last element:

vt_4 = | . | . | . | s_4s_1 - s2s3 |

Which is in fact $|\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q}|$. Last step, splat this value on the whole vector and multiply with $|\textbf{P}|$, calculated previously:

        vdetSRPQ = vec_splat(vdetSRPQ, 3);
        // Multiply with detP and get the full determinant
        vdet = vec_madd(vdet, vdetSRPQ, v0f);
        // If determinant is equal to 0.0 then we have no inverse and return NULL
        if (vec_all_eq(vdet, v0f)) return NULL;

Notice that if vdet holds a zero vector then $|\textbf{M}| = 0$ and the matrix has no inverse.

Calculate the inverse $(\textbf{S} - \textbf{R} \cdot \textbf{P}^{-1} \cdot \textbf{Q})^{-1}$.

To get the inverse we follow the same procedure as with the $\textbf{P}^{-1}$, so we don't have to do any special explanation:

        // determinant != 0.0 so inverse exists, calculate it
        // Get the inverse of (S - R*P^(-1)*Q), repeat the procedure
        // Get 1/detSRPQ
        invdet = vec_re(vdetSRPQ);
        vt_2 = vec_mergeh(vt_2, vt_1);
        vt_1 = vec_sld(vt_2, vt_2, 8);
        vector float vP_1, vP_2, vQ_1, vQ_2, vR_1, vR_2, vS_1, vS_2;
        // Fix the signs in vS_1:
        vS_1 = (vector float) vec_xor((vector unsigned int) vt_1, u_znzn);
        vS_2 = (vector float) vec_xor((vector unsigned int) vt_2, u_nznz);
        // We now have the adjoint of P
        // Multiply each element with inverse detP to get (S - R*P^(-1)*Q)^(-1)
        vS_1 = vec_madd(vS_1, invdet, v0f);
        vS_2 = vec_madd(vS_2, invdet, v0f);

Now we have calculated $\tilde{\textbf{S}}$!

Calculate the matrices $\tilde{\textbf{R}}$, $\tilde{\textbf{Q}}$ and $\tilde{\textbf{P}}$.

First, we calculate $\tilde{\textbf{R}}$. The method is the same, but since we have to multiply with $-\tilde{\textbf{S}}$, to save one instruction, instead we use vec_nmsub().

        // R' = -S'*(R*P^(-1))
        // Just multiply previous result with vrp
        // using same method
        // Since we have to negate the result, we use instead vec_nmsub()
        vR_1 = vec_nmsub( vec_splat( vS_1, 0 ), vrp_1, v0f );
        vR_2 = vec_nmsub( vec_splat( vS_2, 0 ), vrp_1, v0f );
        // Multiply the 2nd column of p1, add to previous result (using nmsub()
        vR_1 = vec_nmsub( vec_splat( vS_1, 1 ), vrp_2, vR_1 );
        vR_2 = vec_nmsub( vec_splat( vS_2, 1 ), vrp_2, vR_2 );
        // Checkpoint #2: We got R'

Same for $\tilde{\textbf{Q}}$:

        // Q' = -(P^(-1)*Q)*S'
        vQ_1 = vec_nmsub( vec_splat( vpq_1, 0 ), vS_1, v0f );
        vQ_2 = vec_nmsub( vec_splat( vpq_2, 0 ), vS_1, v0f );
        // Multiply the 2nd column of p1, add to previous result
        vQ_1 = vec_nmsub( vec_splat( vpq_1, 1 ), vS_2, vQ_1 );
        vQ_2 = vec_nmsub( vec_splat( vpq_2, 1 ), vS_2, vQ_2 );
        // Checkpoint #3: We got Q'

For $\tilde{\textbf{P}}$, we use plain vec_madd() and after the multiplication we subtract that from $\textbf{P}^{-1}$:

        // P' = P^(-1) - (P^(-1)*Q)*R'
        vP_1 = vec_madd( vec_splat( vpq_1, 0 ), vR_1, v0f );
        vP_2 = vec_madd( vec_splat( vpq_2, 0 ), vR_1, v0f );
        // Multiply the 2nd column of p1, add to previous result
        vP_1 = vec_madd( vec_splat( vpq_1, 1 ), vR_2, vP_1 );
        vP_2 = vec_madd( vec_splat( vpq_2, 1 ), vR_2, vP_2 );
        vP_1 = vec_sub(vp_1, vP_1);
        vP_2 = vec_sub(vp_2, vP_2);
        // Checkpoint #4: We got P'

Merge the matrices into $\textbf{M}^{-1}$.

All the $\tilde{\textbf{P}}$, $\tilde{\textbf{Q}}$, $\tilde{\textbf{R}}$, $\tilde{\textbf{S}}$ matrices occupy the left-most 2x2 part in the 4x4 vectors. We have to do a bit more work to merge these back into our final inverse matrix. Unfortunately there is no builtin command that can do this, so we have to use vec_perm() with a loaded permute mask (in the final version, we load this at the beginning of the function to increase efficiency, all loads should happen as early as possible).

        // Finished, now let's put these back together to one matrix (4 vectors)
        // all P', Q', R', S' matrices have the needed block in the left 2x2 part
        // in the vP_*, vQ_*, vR_*, vS_* vectors. We need to shift right by 64 bits
        // the vQ_* and vS_* vectors and merge left 2x2 block of vP_* with the right
        // 2x2 block of vQ_*. Same with vR_* and vS_*. And then we have our inverse
        // matrix
        vector char perm = { 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23 };
        vm_1 = vec_perm(vP_1, vQ_1, perm);
        vm_2 = vec_perm(vP_2, vQ_2, perm);
        vm_3 = vec_perm(vR_1, vS_1, perm);
        vm_4 = vec_perm(vR_2, vS_2, perm);
        // Store the resulting inverse matrix
        STORE_MATRIX(minv, vm_1, vm_2, vm_3, vm_4);

Performance Results

Here are the results on doing 10000000 matrix inversions on the same matrix (with random data):

File matrix_inverse_altivec.c8.68 KB