Matrix 4x4 multiplication (floats)

(Please see Matrix 4x4 addition/subtraction (floats) for the typedefs and definitions used.)

Matrix multiplication is done on a column x row basis. Given two input matrices m2, m3 we do the multiplication and store the result back to an output matrix m1. Hence the function prototype:

void Mat44MulTo(Mat44 m1, Mat44 m2, Mat44 m3);

The calculation of matrix m1 is done on a per column basis. Since we operate on 128-bit vectors, each vector can hold 4 32-bit floats, or one line of a 4x4 matrix. When working with columns, we need to calculate 4 128-bit vectors, but we only produce results for the first column. To calculate the full matrix product, we add the column results (again 4 128-bit vectors, but each time only the relevant column is non-zero) together.

void Mat44MulTo(Mat44 m1, Mat44 m2, Mat44 m3)
{
        vector float zero;
        vector float vA1, vA2, vA3, vA4, vB1, vB2, vB3, vB4;
        vector float vC1, vC2, vC3, vC4;
 
        // Load matrices and multiply the first row while we wait for the next row
        zero = (vector float) vec_splat_u32(0);
 
        LOAD_ALIGNED_MATRIX(m2, vA1, vA2, vA3, vA4);
        LOAD_ALIGNED_MATRIX(m3, vB1, vB2, vB3, vB4);
 
        // Calculate the first column of m1
        vC1 = vec_madd( vec_splat( vA1, 0 ), vB1, zero );
        vC2 = vec_madd( vec_splat( vA2, 0 ), vB1, zero );
        vC3 = vec_madd( vec_splat( vA3, 0 ), vB1, zero );
        vC4 = vec_madd( vec_splat( vA4, 0 ), vB1, zero );
 
        // By now we should have loaded both matrices and be done with the first row
        // Multiply vA x vB2, add to previous results, vC
        vC1 = vec_madd( vec_splat( vA1, 1 ), vB2, vC1 );
        vC2 = vec_madd( vec_splat( vA2, 1 ), vB2, vC2 );
        vC3 = vec_madd( vec_splat( vA3, 1 ), vB2, vC3 );
        vC4 = vec_madd( vec_splat( vA4, 1 ), vB2, vC4 );
 
        // Multiply vA x vB3, add to previous results, vC
        vC1 = vec_madd( vec_splat( vA1, 2 ), vB3, vC1 );
        vC2 = vec_madd( vec_splat( vA2, 2 ), vB3, vC2 );
        vC3 = vec_madd( vec_splat( vA3, 2 ), vB3, vC3 );
        vC4 = vec_madd( vec_splat( vA4, 2 ), vB3, vC4 );
 
        // Multiply vA x vB3, add to previous results, vC
        vC1 = vec_madd( vec_splat( vA1, 3 ), vB4, vC1 );
        vC2 = vec_madd( vec_splat( vA2, 3 ), vB4, vC2 );
        vC3 = vec_madd( vec_splat( vA3, 3 ), vB4, vC3 );
        vC4 = vec_madd( vec_splat( vA4, 3 ), vB4, vC4 );
 
        // Store back the result
        STORE_ALIGNED_MATRIX(m1, vC1, vC2, vC3, vC4);
}