Matrix 4x4 addition/subtraction (floats)

Let's assume we want to do addition or substraction of 2 4x4 32-bit float matrices. First step is to load the arrays. We will assume that the arrays are 16-byte aligned (all/most SIMD engines require this) which will also give a nice boost. Let's assume we have the following typedef:

// This is to save us the typing
#define ALIGNED16 __attribute__((aligned(16)))
 
typedef float Mat44[4][4] ALIGNED16;

Now Mat44 is a a float[4][4] type which is aligned on 16-byte boundaries. Let's assume we have two functions:

void Add(Mat44 m1, const Mat44 m2);
void AddTo(Mat44 m1, const Mat44 m2, const Mat44 m3);

whereas Add does in place addition of m1+m2 and stores the result back to m1, and AddTo adds m2+m3 and stores the result to m1, without destroying m2 or m3's contents. We will show the AddTo version, the Add is trivial to modify.

First step is to load the matrices. Let's setup a macro first:

#define LOAD_ALIGNED_MATRIX(m, vm1, vm2, vm3, vm4)  \
{                                                   \
        vm1 = vec_ld(0,  (float *)m);               \
        vm2 = vec_ld(16, (float *)m);               \
        vm3 = vec_ld(32, (float *)m);               \
        vm4 = vec_ld(48, (float *)m);               \
}

Now, we can load both matrices with the following short code:

        // Load both float matrices
        vector float vm1_1, vm1_2, vm1_3, vm1_4,
                     vm2_1, vm2_2, vm2_3, vm2_4,
                     vm3_1, vm3_2, vm3_3, vm3_4;
 
        LOAD_ALIGNED_MATRIX(m2, vm2_1, vm2_2, vm2_3, vm2_4);
        LOAD_ALIGNED_MATRIX(m3, vm3_1, vm3_2, vm3_3, vm3_4);

The next step is to do the addition:

        // Do the addition
        vm1_1 = vec_add(vm2_1, vm3_1);
        vm1_2 = vec_add(vm2_2, vm3_2);
        vm1_3 = vec_add(vm2_3, vm3_3);
        vm1_4 = vec_add(vm2_4, vm3_4);

or for subtraction we have the similar code:

        // Do the subtraction
        vm1_1 = vec_sub(vm2_1, vm3_1);
        vm1_2 = vec_sub(vm2_2, vm3_2);
        vm1_3 = vec_sub(vm2_3, vm3_3);
        vm1_4 = vec_sub(vm2_4, vm3_4);

and finally we write back the result to the m1 matrix -which again we assume that it is 16-byte aligned. Again we define a C macro

#define STORE_ALIGNED_MATRIX(m, vm1, vm2, vm3, vm4)  \
{                                                    \
        vec_st(vm1,  0, (float *)m);                 \
        vec_st(vm2, 16, (float *)m);                 \
        vec_st(vm3, 32, (float *)m);                 \
        vec_st(vm4, 48, (float *)m);                 \
}

so we can write:

        // Store back the result
        STORE_ALIGNED_MATRIX(m1, vm1_1, vm1_2, vm1_3, vm1_4);

So the function finally becomes:

void Mat44AddTo(Mat44 m1, Mat44 m2, Mat44 m3)
{
        // Load both float matrices
        vector float vm1_1, vm1_2, vm1_3, vm1_4,
                     vm2_1, vm2_2, vm2_3, vm2_4,
                     vm3_1, vm3_2, vm3_3, vm3_4;
 
        LOAD_ALIGNED_MATRIX(m2, vm2_1, vm2_2, vm2_3, vm2_4);
        LOAD_ALIGNED_MATRIX(m3, vm3_1, vm3_2, vm3_3, vm3_4);
 
        // Do the addition
        vm1_1 = vec_add(vm2_1, vm3_1);
        vm1_2 = vec_add(vm2_2, vm3_2);
        vm1_3 = vec_add(vm2_3, vm3_3);
        vm1_4 = vec_add(vm2_4, vm3_4);
 
        // Store back the result
        STORE_ALIGNED_MATRIX(m1, vm1_1, vm1_2, vm1_3, vm1_4);
}

and for subtraction:

void Mat44SubTo(Mat44 m1, Mat44 m2, Mat44 m3)
{
        // Load both float matrices
        vector float vm1_1, vm1_2, vm1_3, vm1_4,
                     vm2_1, vm2_2, vm2_3, vm2_4,
                     vm3_1, vm3_2, vm3_3, vm3_4;
 
        LOAD_ALIGNED_MATRIX(m2, vm2_1, vm2_2, vm2_3, vm2_4);
        LOAD_ALIGNED_MATRIX(m3, vm3_1, vm3_2, vm3_3, vm3_4);
 
        // Do the subtraction
        vm1_1 = vec_sub(vm2_1, vm3_1);
        vm1_2 = vec_sub(vm2_2, vm3_2);
        vm1_3 = vec_sub(vm2_3, vm3_3);
        vm1_4 = vec_sub(vm2_4, vm3_4);
 
        // Store back the result
        STORE_ALIGNED_MATRIX(m1, vm1_1, vm1_2, vm1_3, vm1_4);
}

The bar chart shows the performance gain from the AltiVec routines against the scalar ones.