freevec.org

  • about
  • benchmarks
Home

Search

Primary links

  • About
    • History of libfreevec
  • Benchmarks
    • libfreevec

Please donate to libfreevec to ensure its continuing development! Donations are done via Paypal.





libfreevec NG!!

markos — Tue, 24/03/2009 - 23:24

I'm in the process of rewriting libfreevec and porting it to other SIMD platforms, apart from AltiVec (which I consider dead or dying, unfortunately, thanks to the Big Powers that decided it's no longer important along with PowerPC, but that should be another topic). Anyway, the main platforms chosen are AltiVec (of course :), SSE (SSE2, SSE3 and possiby SSE4), ARM NEON and Cell SPU.

The idea behind libfreevec is not restricted to AltiVec anyway. I have proven that glibc, the #1 libc used on Linux, is totally unoptimized even for common platforms (such as x86 and x86_64), and there are performance gains that could/should materialize if someone took the effort to do it. So, I've decided to do exactly that.

First, I'll extend libfreevec to be a full blown libc, and will try to at least be source-compatible with glibc (that's a definitive must, ABI compatible would be a nice plus, but I don't know if I can do it yet, probably not). For this purpose, I'm also rewriting the make system and will use cmake instead. I have to say, so far it has reduced both compile times, and debugging times by a factor of 10!! No more messing around with stupid configure/autoconf scripts. Good riddance!

Second, I'm abstracting the actual functions. After all, a memcpy uses the same algorithm, no matter what the platform or the SIMD engine is, right? This way, I'm just including a header file that has all the macros (or actually inline functions, as I moved away from macros, inline functions are much easier to debug) necessary for the particular function. The file is automatically included depending on the SIMD engine used at compile time (or scalar, if no SIMD engine was defined).

Also, I've started work on rewriting the IEEE754 math functions used in glibc/libm. I've often mentioned that these are slow as molasses on ALL platforms, and now I can prove it, here are some results:

> ./test_trigf
Populated 100000000 floats in the range [0..pi/4]
dt = 2.830000
Glibc       :    35335689.05 calculations of cosf()/sec
<cos(x)> = 0.167772
dt = 2.240000
libfreevec  :    44642857.14 calculations of cosf()/sec
<cos(x)> = 0.167772
vec_cosf fail/tot = 1456251/100000000, maxerror = 0.0000001
dt = 1.710000
Glibc       :    58479532.16 calculations of sinf()/sec
<sin(x)> = 0.167772
dt = 2.390000
libfreevec  :    41841004.18 calculations of sinf()/sec
<sin(x)> = 0.167772
vec_sinf fail/tot = 98844217/100000000, maxerror = 0.0635434
dt = 2.100000
Glibc       :    47619047.62 calculations of tanf()/sec
<tan(x)> = 0.167772
dt = 2.220000
libfreevec  :    45045045.05 calculations of tanf()/sec
<tan(x)> = 0.167772
vec_tanf fail/tot = 125883/100000000, maxerror = 0.0000001
dt = 4.190000
Glibc       :    23866348.45 calculations of coshf()/sec
<cosh(x)> = 0.335544
dt = 1.470000
libfreevec  :    68027210.88 calculations of coshf()/sec
<cosh(x)> = 0.335544
vec_coshf fail/tot = 23772394/100000000, maxerror = 0.0000001
dt = 4.470000
Glibc       :    22371364.65 calculations of sinhf()/sec
<sinh(x)> = 0.167772
dt = 1.380000
libfreevec  :    72463768.12 calculations of sinhf()/sec
<sinh(x)> = 0.167772
vec_sinhf fail/tot = 111824/100000000, maxerror = 0.0000001
dt = 7.980000
Glibc       :    12531328.32 calculations of tanhf()/sec
<tanh(x)> = 0.167772
dt = 1.270000
libfreevec  :    78740157.48 calculations of tanhf()/sec
<tanh(x)> = 0.167772
vec_tanhf fail/tot = 803327/100000000, maxerror = 0.0000001

Ok, these are preliminary results, and I probably can do better both in terms of accuracy and in terms of speed (I'm especially disappointed with sinf() which is definitely using a wrong approximant function, but I expect to find the culprit soon). All tests were done on an Athlon X2 @2.5Ghz (the same used in previous benchmarks), the glibc version used was glibc-2.9-2.11.1 (opensuse 11.1 package, 64-bit version). The max error there reports the maximum difference between my version and the glibc version in X tests out of 100 million (ok, you have to admit a max error of 1 * 10^(-7) is negligible :)
Also, these are all PLAIN C versions, no asm or any optimization used. The good thing is that they are ~20 C lines max, for each function, much easier to read than the spaghetti mess in glibc. When I do the custom optimizations per arch, even the functions that are now faster in glibc, are going to get totally trounced. Btw, all functions in libfreevec have a consistent speed, however most functions in glibc perform good in the 0..pi/4 range (possibly due to the SINCOS asm instruction, but lose great speed when the sample used is [-pi..pi] (which is the generic case) and are in fact slower than libfreevec.
Stay tuned...

  • libfreevec
  • Login or register to post comments

great news :) About the

ggael — Thu, 26/03/2009 - 11:59

great news :)

About the math functions, are you using similar techniques than the cephes lib ? (http://www.netlib.org/cephes/)

I'm asking because I just added SSE versions of the cephes's sin, cos, exp and log functions (all credits go to Julien Pommier http://gruntthepeon.free.fr/ssemath/) and they work pretty well. But if you have something even better to propose...

Also, the cephes's sin and cos routines works very well in the range ~ [-8000 : 8000] while the quality slowly degrades for larger values. Have you experimented a similar behavior ?

  • Login or register to post comments

not certain

markos — Thu, 26/03/2009 - 14:43

Not certain about this.
I think that Cephes uses Taylor expansions for most of the functions, whereas I use Padé approximants (which are much faster to calculate as they use much less terms, difference in speed is from 50-200%). I'll organize the code a little as it's a mess currently and will release it shortly. I have also some new nice results from exp() functions, but I have to finetune the polynomial constants in the terms to get full accuracy (right now I get ~3 *10^(-7), which is ok, but not fully IEEE754. I'll let you know when I'm done with this.

  • Login or register to post comments

SIMD

  • Algorithms (31)
    • Algebra (9)
      • Matrix operations (8)
    • Bit operations (0)
    • Codecs (0)
      • Audio (0)
      • Video (0)
    • Comparison (0)
      • image comparison (0)
      • Levenshtein (0)
    • Compression (0)
      • Bzip2 (0)
      • Gzip (0)
      • LZMA (0)
      • LZW (0)
      • Squashfs (0)
      • Zlib (0)
    • Encryption (0)
      • AES (0)
      • DES (0)
      • RSA (0)
      • Salsa (0)
      • SSL (0)
    • Hashing (1)
      • CRC (0)
      • TCP/IP checksum (0)
      • UMAC (0)
    • Memory operations (15)
    • Multiprecision (0)
    • Searching (5)
      • String searching (5)
    • Sorting (0)
  • Software (32)
    • Benchmarking (2)
    • Libraries (30)
      • Eigen2 (0)
      • libfreevec (22)
      • simdX86 (8)
  • Architecture (32)
    • AltiVec (32)
    • ARM NEON (0)
    • CELL SPU (0)
    • SSE (0)
    • VIS (0)

User login

  • Create new account
  • Request new password
  • about
  • benchmarks

Copyright (c)2008 by CODEX.
Powered by Drupal. Using theme Deco.
All Google charts have been created by the CSV Chart and Chart API Drupal modules.