Matrix 4x4 Transpose (floats)

(Please see Matrix 4x4 addition/subtraction (floats) for the typedefs and definitions used.)

For the theory behind matrix transposition, please see here.

So, the 4x4 transpose would be:

| A1 | A2 | A3 | A4 |T    | A1 | B1 | C1 | D1 |
| B1 | B2 | B3 | B4 |     | A2 | B2 | C2 | D2 |
| C1 | C2 | C3 | C4 |  =  | A3 | B3 | C3 | D3 |
| D1 | D2 | D3 | C4 |     | A4 | B4 | C4 | D4 |

Basically we want to shuffle the elements in each vector in a particular way to end up with the transposed matrix. We could do it with a series of vec_permute instructions, but AltiVec offers us an easier and faster way: vec_mergeh/vec_mergel. Basically, what they do is this:

| A1 | A2 | A3 | A4 |
| B1 | B2 | B3 | B4 |
        |            \
   vec_mergeh      vec_mergel
        |              \
        v               \
| A1 | B1 | A2 | B2 |    | A3 | B3 | A4 | B4 |

So in effect we do two steps of mergeh/mergel operations:

vec_mergeh(Line 1, Line3) -> | A1 | C1 | A2 | C2 |
vec_mergel(Line 1, Line3) -> | A3 | C3 | A4 | C4 |
vec_mergeh(Line 2, Line4) -> | B1 | D1 | B2 | D2 |
vec_mergel(Line 2, Line4) -> | B3 | D3 | B4 | D4 |

2nd step:

vec_mergeh(Line 1, Line3) -> | A1 | B1 | C1 | D1 |
vec_mergel(Line 1, Line3) -> | A2 | B2 | C2 | D2 |
vec_mergeh(Line 2, Line4) -> | A3 | B3 | C3 | D3 |
vec_mergel(Line 2, Line4) -> | A4 | B4 | C4 | D4 |

And here is the final code:

void Mat44Transp(Mat44 m)
{
        vector float vm_1, vm_2, vm_3, vm_4,
                     vr_1, vr_2, vr_3, vr_4;
        // Load matrix
        LOAD_ALIGNED_MATRIX(m, vm_1, vm_2, vm_3, vm_4);
 
        // Do the transpose, first set of moves
        vr_1 = vec_mergeh(vm_1, vm_3);
        vr_2 = vec_mergel(vm_1, vm_3);
        vr_3 = vec_mergeh(vm_2, vm_4);
        vr_4 = vec_mergel(vm_2, vm_4);
        // Get the resulting vectors
        vm_1 = vec_mergeh(vr_1, vr_3);
        vm_2 = vec_mergel(vr_1, vr_3);
        vm_3 = vec_mergeh(vr_2, vr_4);
        vm_4 = vec_mergel(vr_2, vr_4);
 
        // Store back the result
        STORE_ALIGNED_MATRIX(m1, vm_1, vm_2, vm_3, vm_4);
}