Differences between revisions 1 and 2
Revision 1 as of 2008-08-29 07:22:58
Size: 2744
Editor: AgnerFog
Comment:
Revision 2 as of 2008-08-29 07:35:01
Size: 2766
Editor: AgnerFog
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Agner Fog's Wishlist = = Agner Fog's Wishlist: =
== Optimize libc ==

Agner Fog's Wishlist:

Optimize libc

I am doing research on software optimization. I have made comparisons of different compilers and function libraries. My tests show that gcc is one of the best optimizing compilers for x86 and x86-64 platforms, but libc is one of the poorest optimized function libraries. I would recommend that somebody work on making libc faster on x86 and x86-64. This would include:

  • CPU dispatching. Critical functions should have one version for the latest instruction set and one version that is compatible with the oldest instruction set. In a few cases there would be more than two versions. A general framework for CPU dispatching would be great.
  • Memory functions. Functions like memcpy and memmove are among the most important to optimize. These functions should use XMM registers (or YMM registers when available) to move the largest possible amount of data per iteration. If source and destination are not aligned by 16 then make aligned reads into XMM registers; shift and combine adjacent reads to fit the alignment of the destination; and then write aligned. This method can improve the speed by a factor 4-5 in the unaligned case when data are in the level-1 cache on the newest microprocessors. (Needs different paths for SSE2, AMD SSE5 and Intel AVX).
  • String functions. Functions like strlen, strcpy, strchr, strcmp etc. can use SSE2 instructions in XMM registers to compare 16 bytes in one instruction. The SSE4.2 instruction set can perhaps provide a little further improvement, but the most important improvement is to use SSE2.
  • Math functions. Use SSE2 if available. Use the full XMM register. Functions that use Taylor expansion can be improved by rolling out the Taylor loop by 4 and computing xn from xn-4 * x4 in order to use the maximum throughput of the multiplier in the CPU.

  • Vector math functions. Gcc supports automatic vectorization of math functions with the option -mveclibabi, but the corresponding fuction library is not available. Currently, you have to use libraries from Intel or AMD. It would be nice to have optimal support for both Intel and AMD in the same library.

Resources

I don't have the time to work on Gnu projects myself, but I have provided a number of useful resourses:

You are welcome to use my code in libc. Contact me for the necessary copyright assignments etc. (I will not help you with your private projects, though). Find my Email address at my homepage.

None: AgnerWishlist (last edited 2008-08-29 07:35:01 by AgnerFog)