Matrix 4x4 Identity matrix

(Please see Matrix 4x4 addition/subtraction (floats) for the typedefs and definitions used.)

The nice thing about the identity matrix, is that we don't have to do any reading of the matrix. And since the form of the identity matrix is already known:

    | 1 0 0 0 |
I = | 0 1 0 0 |
    | 0 0 1 0 |
    | 0 0 0 1 |

We can simple play a bit with vec_splat and the vector-bit shifting operations to create 4 vectors which we will only write to the destination matrix. No actual read access to the memory will be needed!

First we splat 1.0 to a vector float (actually splat 32-bit 1 (int) to a vector int and use vec_ctf() to convert to a vector float). So we have:

| 1.0 | 1.0 | 1.0 | 1.0 |

Then we need to shift that right by 12 bytes to get the fourth line in the matrix:

| 0.0 | 0.0 | 0.0 | 1.0 |

We achieve that by the use of vec_sll instruction, but this one needs the shift operand in another vector, so could we just vec_splat(12) and do it? Well not quite. It needs the operand shifted left by 3 bits. So, in essence, we do:

vector unsigned int sh;
sh = vec_splat_u32(12);
sh = vec_sll(sh, vec_splat_u32(3));

And finally, we can produce i3, i2, i1 in sequence just by doing a vec_sld instruction, shifting 4 bytes at a time. Here is the final code:

void Mat44Identity(Mat44 m)
{
        vector float vi1, vi2, vi3, vi4;
        vector unsigned int sh;
 
        i4 = vec_ctf(vec_splat_u32(1),0);
        sh = vec_splat_u32(12);
        sh = vec_sll(sh, vec_splat_u32(3));
        i4 = vec_sro(i4, (vector char)sh);
        i3 = vec_sld(i4, i4, 4);
        i2 = vec_sld(i3, i3, 4);
        i1 = vec_sld(i2, i2, 4);
 
        // Store back the identity matrix
        STORE_ALIGNED_MATRIX(m, vi1, vi2, vi3, vi4);
}

You might ask, why bother, one could just memset to 0.0 and then set the diagonal elements to 1.0. Well, see for yourself in the following benchmarks:

And here is the explanation: calling memset() involves an overhead of a function call, hence the very slow speed. But even if we unroll the loop and just set every element to 0.0, the performance is still ~50% of the altivec performance, because the FPU unit has half the bandwidth of the AltiVec unit (64-bit instead of 128-bit).