This is the mail archive of the
mailing list for the glibc project.
Memory and string functions can be improved dramatically on x86 andx86-64
- From: Agner Fog <agner at agner dot org>
- To: libc-help at sourceware dot org
- Date: Fri, 01 Aug 2008 08:22:00 +0200
- Subject: Memory and string functions can be improved dramatically on x86 andx86-64
- Organization: email@example.com
I am doing research on optimization of microprocessors and compilers.
Some of you probably know my optimization manuals (www.agner.org/optimize/).
I have tested many different compilers and compared how well they
optimize C++ code. I have been pleased to observe that gcc has been
improved a lot in the last couple of years. The gcc compiler itself is
now matching the optimizing performance of the Intel compiler and it
beats all other compilers I have tested. The many hard-working
developers deserve credit for this! Unfortunately, libc turns out to be
a weak point in the comparison. The performance of libc on memory and
string functions is poor compared to other function libraries because it
doesn't use the XMM registers. See my test results below.
If somebody would do the job of updating these functions then we would
have the wonderful situation where gcc/libc would be the best optimizing
solution for all x86 and x86-64 platforms.
Test results. Memcpy function on Intel Core 2 processor, core clock
cycles per byte of data:
Function library aligned by 16 unaligned data
gcc builtin 0.18 1.21
libc 2.7 32 bit 0.18 0.57
libc 2.8 32 bit 0.18 0.58
libc 2.7 64 bit 0.18 0.44
Microsoft 0.12 0.63
CodeGear 0.18 0.75
Intel 0.12 0.18
Mac 0.11 0.11
My own library 0.11 0.12
As you can see, the speed of memcpy in libc can be improved by a factor
4-5 for unaligned data on a Core 2. The default builtin version is still
slower. On an AMD K8 CPU there is less difference between the
performance of the different libraries because K8 has only 64-bit
internal data paths so it cannot make the full advantage of the 128-bit
XMM registers. I expect the AMD K10 to perform similarly to Intel Core 2
because it has 128-bit data paths. However, I haven't had the chance to
test this on an AMD K10 yet.
There are significant performance differences on the strlen function and
other functions as well. You can find my complete test results at
http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6.
I would recommend that you make CPU-dispatching for the different
instruction sets in the most important memory and string instructions
and take advantage of the newest instruction sets if available. Of
course the old computers without SSE should still be supported, but the
99% users who have SSE2 or later should not be penalized for the sake of
compatibility with old CPUs.
The work of putting this into libc should not be too big. Open source
optimized code is available in Mac/Xnu, in OpenSolaris, and in my own
function library "asmlib" at www.agner.org/optimize/asmlib.zip
All these have open source licenses, although with various differences.
I don't know if these differences in license conditions cause legal
problems that cannot be solved through negotiation. At least I am
willing to grant the necessary licenses to the Gnu/libc project if you
want to use my code.
I am not going to join the libc development team because I have lots of
other work to do, I am just offering my advice.