Frequently Asked Questions - libfreevec


libfreevec is a library with many common glibc routines, rewritten and optimized to use the AltiVec vector engine found in the G4/G4+ PowerPC CPUs by Freescale. These processors are most commonly used in the older Apple Mac computers that use PowerPC CPUs (not the new MacIntels), and the Genesi Pegasos II Open Desktop Workstation (ODW). In addition to the glibc routines, it also includes custom routines which were designed to speed up various other performance-critical tasks. For example, it includes some special AltiVec optimized versions of various string and memory functions from libstring which is part of the MySQL package.

Well, libfreevec already runs on G5, Cell and Power6, but it's not optimized for 64-bit yet. Performance is good, but it can be even better. Basically, a couple of things have to be done eg. use 64-bit ints when manipulating with small buffers (too small for AltiVec, that is), instead of 32-bit ints as used now. Secondly, the prefetching method has to change (970, Cell and Power6 don't provide data streams like the G4, dcbz/t has to be used instead). Thirdly, some altivec instruction reordering might have to take place for maximum performance. In short a few things have to be done to fully support 64-bit PowerPC cpus.

The license chosen is the LGPL. This means that you are free to use it in free software projects, and you're most welcome to use it even in closed-source/proprietary software, albeit with a few not irrational restrictions. Read the license text for details.

Indeed. And much to their credit I'm still struggling to reach the performance of this library. These guys really knew what they did :-)
Now, when I started libfreevec, libmotovec's license was different than what it is now (now it uses a BSD-like license) and that was the reason I started libfreevec. Secondly, there were no efforts made to included it in a standard libc (like glibc or uclibc, or newlib, etc) as it would make sense then. libfreevec was started with that idea in mind, to eventually be incorporated into glibc. Thirdly, libmotovec is written in ppc assembly, which although is great for performance, it's really difficult to maintain/debug/whatever. Not many people are that proficient with ppc asm, I know I am not. libfreevec is written in C with some very small parts written in ppc asm, using the AltiVec intrinsics available in GCC. This means that as GCC gets better instruction scheduling for the G4, so libfreevec might benefit from this as well (and actually it has, gcc 4.2.x performs much better than gcc 3.x, which shows in libfreevec benchmarks).

In addition, the way it's written right now, it allows for easy expansion with new functions, or even optimization of the existing ones, with just fiddling with a couple of macro modules.

Finally, libfreevec offers more AltiVec optimized functions than libmotovec, plus a consistent cache-prefetching mechanism used in all of the available functions, and if I might say so myself, some of the libfreevec functions are even faster than the libmotovec equivalents.:-)

That's not really a solution. To be truthful, I actually have looked at the source code of the libmotovec's functions. But my knowledge of ppc asm is not great, and though I could understand most of the logic behind the functions, there were some parts that evaded me. I'm sure there are developers that will probably find it as obvious as sunlight, but these are very few (compared to the number of developers that speak x86 assembly, for example). The main issue is that of maintainability, and one could hardly argue that C is more maintainable than assembly. As an AltiVec guru, Holger Bettag, says quite often, "AltiVec asm might give you an extra 2% performance, but why bother?". Holger, I paraphrased it a little bit, I hope you don't mind!

Actually I intend to, but not this particular code. The goal for liboil is slightly different, it offers its own API, and a whole lot of highly optimized routines to perform various algorithms. On the other hand, I wanted to optimize existing functions from GLIBC, libstring (which is heavily used in MySQL), etc. I do plan to write some code for liboil at a later stage, but not at this particular moment.

It depends. If your program does 1 million memcpy() calls, of 5 bytes each, the library will not benefit you at all. It might actually even be slower, due to the a slightly bigger overhead. Actually, in truth it's quite bad design to do a memcpy for such a small buffer. On the other hand, depending on what your application does, you might enjoy significant benefit from using such a library. Eg. the AltiVec version of swab() in this library is ~7x faster than the scalar code. But it won't make a difference to your program if you only call it at the initialization code for a 100 byte buffer.

Also, does your app use mainly aligned or unaligned buffers? So far, I can say that the performance hit from unaligned addresses is mostly minimized, but it's of course a penalty.

Quite true. AltiVec is a very powerful beast but it needs lots of data to feed. Throughout the library I try to use AltiVec only where it's useful and needed. Most of the times, I just redesign an algorithm to be more efficient, and after a particular size threshold, AltiVec kicks in. That way I try to get the best of both worlds.

Well, for certain AltiVec is not used in these cases. Most probably the original algorithm was quite dumb (ie. not optimized). When I was writing the replacement functions, I always had in mind that they have to be equally fast, if not faster, to the originals, for small sizes. It would look bad if a program that uses memcpy(), but only on smaller sizes, became slower for that reason. I tried to achieve this as much as possible, though I might have missed something. That's why user and developer feedback is so important, so send them patches :-)

Assuming you refer to the memory functions, it's because data has to be fetched from the actual memory rather than the L1 or L2 caches. And AltiVec has an 128-bit bus but to the L1 and L2 caches not to the main memory. Still, I use cache prefetching in most of the functions and the performance will still be better than the original functions. Don't expect miracles though, in these cases a 20-30% performance gain is more likely rather than a 10x.

Try to use as much aligned data as possible. eg using well known tricks like the following:

 unsigned char __attribute__ ((aligned(16))) *buffer;

instead of just declaring a variable. That way, you'll skip the time spent on handling unaligned data. Also, try to avoid useless invocations of memcpy() or similar functions for tiny buffers. Though GCC is supposed to inline copying code for some cases, this is not guaranteed and should not be taken for granted all of the time. Instead try to organize your data into bigger structures. It's better in the long run anyway.

Check this site often as every function will be analyzed thoroughly.

You can check here.

There exists a Subversion repository, for the latest version check svn://svn.codex.gr/trunk/freevec.