32-bit *signed* integer multiplication with AltiVec

While completing Eigen2 AltiVec support (should be almost complete now), I noticed that the 32-bit integer multiplication didn't work correctly all of the time. As AltiVec does not really include any instruction to do 32-bit integer multiplication, I used Apple's routine from the Apple Developer's site. But this didn't work and some results were totally off. With some debugging, I found out that this routine works for unsigned 32-bit integers, where Eigen2 uses signed integers! So, I had to search more, and to my surprise, I found no reference of any similar work. So I had 2 choices: a) ditch AltiVec integer vectorisation from Eigen2 (not acceptable!) b) implement my own method! It is obvious which choice I followed :)
UPDATE: Thanks to Matt Sealey, who noticed I could have used vec_abs() instead of vec_sub() and vec_max(). Duh! :D

Anyway, here is the code with some explanations, but just as a quick explanation, I can just say that there were 3 stages to the algorithm:

  • get the signs of the multiplication using xor, in particular, if
    v1 = | A1 | A2 | A3 | A4 |, v2 = | B1 | B2 | B3 | B4 |
     
    (v1 xor v2) -> | sgn(A1*B1) | sgn(A2*B2) | sgn(A3*B3) | sgn(A4*B4) |

    and we just have to use compare each 32-bit quantity with 0 and produce the needed mask.

  • Get the absolute values of each vector and do the multiplication as per the Apple method for unsigned integers.
  • Change the signs of the required elements, according to the mask, using basic binary arithmetic. A negative number is the two's complement +1: -A = ~A +1. So we just NOR the negative elements, add 1 to them and merge the results back to the final vector.

    Here is the code (taken and modified from the Eigen source).

    vector int vmulws(const vector int&   a, const vector int&   b)
    {
      v4i bswap, low_prod, high_prod, prod, prod_, a1, b1, v1sel;
     
      // Get the absolute values
      a1  = vec_abs(a);
      b1  = vec_abs(b);
     
      // Get the signs using xor
      v4bi sgn = (v4bi) vec_cmplt(vec_xor(a, b), v0i);
     
      // Do the multiplication for the asbolute values.
      bswap = (v4i) vec_rl((v4ui) b1, (v4ui) v16i_ );
      low_prod = vec_mulo((vector short)a1, (vector short)b1);
      high_prod = vec_msum((vector short)a1, (vector short)bswap, v0i);
      high_prod = (v4i) vec_sl((v4ui) high_prod, (v4ui) v16i_);
      prod = vec_add( low_prod, high_prod );
     
      // NOR the product and select only the negative elements according to the sign mask
      prod_ = vec_nor(prod, prod);
      prod_ = vec_sel(v0i, prod_, sgn);
     
      // Add 1 to the result to get the negative numbers
      v1sel = vec_sel(v0i, v1i, sgn);
      prod_ = vec_add(prod_, v1sel);
     
      // Merge the results back to the final vector.
      prod = vec_sel(prod, prod_, sgn);
      return prod;
    }