Though I have been working on SIMD optimizations for many years, paid and as a hobby, I didn't have to explicitly mention that in the past. After some suggestion from fellow professionals, I decided it would be a good idea to actually and explicitly write that in my personal blog.
SIMD optimization can happen in several ways, but here is a typical process:
  1. Profile the software in question and find the routine/method/function/etc that takes most of the time. Most common candidates are functions that are called millions/billions of times per second, but also functions that are called once but take up a long time to compute. Either way, it's an optimization.
  2. Once we have found the culprit function, we continue to the algorithm evaluation
  3. Evaluation of the algorithm used, is vectorizations really necessary? Sometimes a huge benefit in performance is achieved by just switching to a better algorithm.
  4. If there is no better algorithm applicable to this case, is it easily parallelizable/vectorizable? (they are not the same)
  5. Move to the next step, which is to break the operation in smaller discrete steps, inside the loop.
  6. Unroll the loops at least as many times as there are objects in the SIMD vector (that is 4 times if we're dealing with 32-bit floats/ints, etc).
  7. Replace instructions with vector ones
  8. Finetune, play with instruction reordering, etc
  9. If performance is still not enough, consider rewriting the critical code in assembly (but that will take much longer and is much harder to debug and has also drawbacks, like miss out on any compiler instruction scheduling).

This the process I am trying to follow with good result most of the time when I'm doing performance optimization in a particular software. I explain the process in much more detail and some examples in my talk here.

If you would be interested in having your software (closed/open source), I would be willing to give you a free evaluation in 1-2 days, that is if it's vectorizable, is it worth it, what kind of performance to expect and how long would vectorization take.

Right now, the platforms I am most proficient in are PowerPC AltiVec/VMX/VSX, ARM NEON and to a lesser extent SSE/AVX (I do some occasional SSE coding, but I am as familiarized as I am with PowerPC/ARM). However, the process is the same amongst all architectures.

For a quote, please contact me at markos at freevec dot org