I am doing research on optimization of microprocessors and compilers.
Some of you probably know my optimization manuals
(www.agner.org/optimize/).
I have tested many different compilers and compared how well they
optimize C++ code. I have been pleased to observe that gcc has been
improved a lot in the last couple of years. The gcc compiler itself is
now matching the optimizing performance of the Intel compiler and it
beats all other compilers I have tested. The many hard-working
developers deserve credit for this! Unfortunately, libc turns out to
be a weak point in the comparison. The performance of libc on memory
and string functions is poor compared to other function libraries
because it doesn't use the XMM registers. See my test results below.
If somebody would do the job of updating these functions then we would
have the wonderful situation where gcc/libc would be the best
optimizing solution for all x86 and x86-64 platforms.
Test results. Memcpy function on Intel Core 2 processor, core clock
cycles per byte of data:
Function library aligned by 16 unaligned data
---------------------------------------------------
gcc builtin 0.18 1.21
libc 2.7 32 bit 0.18 0.57
libc 2.8 32 bit 0.18 0.58
libc 2.7 64 bit 0.18 0.44
Microsoft 0.12 0.63
CodeGear 0.18 0.75
Intel 0.12 0.18
Mac 0.11 0.11
My own library 0.11 0.12
---------------------------------------------------
As you can see, the speed of memcpy in libc can be improved by a
factor 4-5 for unaligned data on a Core 2. The default builtin version
is still slower. On an AMD K8 CPU there is less difference between the
performance of the different libraries because K8 has only 64-bit
internal data paths so it cannot make the full advantage of the
128-bit XMM registers. I expect the AMD K10 to perform similarly to
Intel Core 2 because it has 128-bit data paths. However, I haven't had
the chance to test this on an AMD K10 yet.
There are significant performance differences on the strlen function
and other functions as well. You can find my complete test results at
http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6.
I would recommend that you make CPU-dispatching for the different
instruction sets in the most important memory and string instructions
and take advantage of the newest instruction sets if available. Of
course the old computers without SSE should still be supported, but
the 99% users who have SSE2 or later should not be penalized for the
sake of compatibility with old CPUs.
The work of putting this into libc should not be too big. Open source
optimized code is available in Mac/Xnu, in OpenSolaris, and in my own
function library "asmlib" at www.agner.org/optimize/asmlib.zip
All these have open source licenses, although with various
differences. I don't know if these differences in license conditions
cause legal problems that cannot be solved through negotiation. At
least I am willing to grant the necessary licenses to the Gnu/libc
project if you want to use my code.
I am not going to join the libc development team because I have lots
of other work to do, I am just offering my advice.