Matrix 4x4 scaling (floats)

(Please see Matrix 4x4 addition/subtraction (floats) for the typedefs and definitions used.)

Scaling a matrix implies multiplying each element with a float. Assume the following prototype:

void Mat44ScaleTo(Mat44 m1, Mat44 m2, float f);

where we multiply matrix m2 with the float f and store the result into m1.

With a SIMD unit we can do 4 multiplications in parallel, each matrix row (a 4-float vector that is) with the float, but there is no such instruction in AltiVec at least. We can do a vector x vector multiplication though, but we first have to make sure that every element of the scaling vector will have the same value, equal to the given float (f in this case). This can be accomplished with the vec_splat AltiVec intrinsic.

        float fscalar[4];
        vector float vscalar;
 
        // Set up scalar vector
        fscalar[0] = f;
        vscalar = vec_ld(0, fscalar);
        vscalar = vec_splat(vscalar, 0);

Of course, we have to load the matrix again, we'll use the LOAD_ALIGNED_MATRIX() macro as before.

        // Load matrix
        LOAD_ALIGNED_MATRIX(m2, vm2_1, vm2_2, vm2_3, vm2_4);

The scaling is done using the vec_madd() intrinsic. This, however, takes three arguments as it performs the a*b+c operation (multiply+add, a, b, c, are vector floats). Of course, we'll just add a zero(0) vector in place of c. We'll do it again using vec_splat().

        vector float v0 = (vector float)vec_splat_u32(0);
        // Do scaling
        vr_1 = vec_madd(vm2_1, vscalar, v0);
        vr_2 = vec_madd(vm2_2, vscalar, v0);
        vr_3 = vec_madd(vm2_3, vscalar, v0);
        vr_4 = vec_madd(vm2_4, vscalar, v0);

And we store the result:

        // Store back the result
        STORE_ALIGNED_MATRIX(m1, vm1_1, vm1_2, vm1_3, vm1_4);

The final form of the function is the following:

void Mat44ScaleTo(Mat44 m1, Mat44 m2, float f)
{
        float fscalar[4];
        vector float vscalar, v0, 
                     vm1_1, vm1_2, vm1_3, vm1_4, 
                     vm2_1, vm2_2, vm2_3, vm2_4;
 
        // Set up scalar vector
        fscalar[0] = f;
        vscalar = vec_ld(0, fscalar);
        vscalar = vec_splat(vscalar, 0);
        v0 = (vector float) vec_splat_u32(0);
 
        // Load matrix
        LOAD_ALIGNED_MATRIX(m2, vm2_1, vm2_2, vm2_3, vm2_4);
 
        // Do scaling
        vm1_1 = vec_madd(vm2_1, vscalar, v0);
        vm1_2 = vec_madd(vm2_2, vscalar, v0);
        vm1_3 = vec_madd(vm2_3, vscalar, v0);
        vm1_4 = vec_madd(vm2_4, vscalar, v0);
 
        // Store back the result
        STORE_ALIGNED_MATRIX(m1, vm1_1, vm1_2, vm1_3, vm1_4);
}