This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New x86-64 memcpy

On Fri, Feb 16, 2007 at 05:38:46PM -0600, Menezes, Evandro wrote:
> I implemented a new version of memcpy for x86-64 that provides an overall
> performance improvement over the current one on both AMD and Intel
> processors.
> It has several algorithms tuned for specific block size ranges,
> considering the sizes of the cache subsystems.  For instance, making use
> of repeated string instructions, software prefetching and streaming
> stores.
> As it uses several algorithms depending on the block size, the code is
> fairly long.  But given that doesn't really need as many algorithms,
> at build-time a specialized version for has only a handful of worthy
> algorithms.
> In addition to the source-code patches, I also attached the resulting data
> obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a 3GHz Core2 with
> DDR2-533.  The file memcpy-opteron-old.txt has the original output of
> string/test-memcpy on the Athlon64 system and the file
> memcpy-opteron-new.txt the output using the new routine.  The files
> memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results but
> on the Core2 system.

I see a few issues:
1) as the l1/l2 cache sizes and prefetchw flag are only used in
version, there is no point to have those vars (why were they 8 byte rather
than 4 byte btw?) in _rtld_global, they can very well be hidden inside
of, therefore they can be accessed like:
movl _x86_64_l1_cache_size_half(%rip), %r8d
which is certainly faster than loading its address from GOT and then
using second memory load read the actual value.  The values can be
initialized in a static routine with constructor attribute.
2) even for Intel CPUs it is possible to determine L1 data cache size
and glibc's sysconf (_SC_LEVEL1_DCACHE_SIZE) already knows how to do it
3) the function didn't have cfi directives, eventhough it changes %rsp
and saves/restores call saved registers
4) various formatting issues (spaces instead of tabs etc.)
5) glibc i?86/x86_64 assembly style uses explicit instruction suffixes

So, attached are the two patches combined with the above things changed.
Initially I thought cacheinfo.c could just call
__sysconf (_SC_LEVEL1_DCACHE_SIZE) and __sysconf (_SC_LEVEL2_CACHE_SIZE),
unfortunately that doesn't work because the test (the one to determine
which objects are needed to compile from libc_pic.a as rtld-*.os)
doesn't link then - the real sysconf just brings with it too much
from libc_pic.a.  Perhaps even better would be to unify the cacheinfo
detection between i386 and x86_64 (basically have one common
cacheinfo.h with most of the routines, but using cpuid inline routine)
and then separate i386 and x86_64 cacheinfo.c including it and defining
its own version of cpuid inline (on x86_64 we don't need to dance around
%ebx), i386 cacheinfo.c would include detection whether cpuid insn
can be used at all and x86_64 cacheinfo.c would include these new
_x86_64_* variables and constructor.

BTW, why do you use push/pop instead of just saving/restoring the values
from red zone?  That would mean at least simpler unwind info.
Also, for mempcpy, IMHO it is a bad idea to compute result value early,
I believe in all code paths the right return value is available in %rdi
register, so the pushq/popq %rax would be unneeded for mempcpy and instead
before each rep; retq you'd add #if MEMPCPY_P movq %rdi, %rax #endif.

Looking at test-memcpy numbers (which I admit is certainly not a good
benchmark), I don't see very visible win on quadcore Core2 though:
$ ~/timing elf/ --library-path vanilla/ string/test-memcpy --direct > /dev/null
Strip out best and worst realtime result
minimum: 0.858424000 sec real / 0.000017988 sec CPU
maximum: 0.885605000 sec real / 0.000041098 sec CPU
average: 0.862428714 sec real / 0.000019401 sec CPU
stdev  : 0.002703822 sec real / 0.000001518 sec CPU
$ ~/timing elf/ --library-path . string/test-memcpy --direct > /dev/null
Strip out best and worst realtime result
minimum: 0.857600000 sec real / 0.000017456 sec CPU
maximum: 1.162678000 sec real / 0.000036033 sec CPU
average: 0.859858000 sec real / 0.000019178 sec CPU
stdev  : 0.001414669 sec real / 0.000001500 sec CPU
$ ~/timing elf/ --library-path vanilla/ string/test-memcpy --direct > /dev/null
Strip out best and worst realtime result
minimum: 0.858311000 sec real / 0.000017796 sec CPU
maximum: 0.905352000 sec real / 0.000038400 sec CPU
average: 0.861902142 sec real / 0.000019158 sec CPU
stdev  : 0.002512279 sec real / 0.000000852 sec CPU
$ ~/timing elf/ --library-path . string/test-memcpy --direct > /dev/null
Strip out best and worst realtime result
minimum: 0.857419000 sec real / 0.000018074 sec CPU
maximum: 0.870351000 sec real / 0.000032102 sec CPU
average: 0.861215571 sec real / 0.000019397 sec CPU
stdev  : 0.002651920 sec real / 0.000001001 sec CPU
$ ~/timing elf/ --library-path vanilla/ string/test-memcpy --direct > /dev/null
Strip out best and worst realtime result
minimum: 0.858271000 sec real / 0.000017894 sec CPU
maximum: 0.866028000 sec real / 0.000038928 sec CPU
average: 0.862063750 sec real / 0.000019215 sec CPU
stdev  : 0.002647184 sec real / 0.000000988 sec CPU
$ ~/timing elf/ --library-path . string/test-memcpy --direct > /dev/null
Strip out best and worst realtime result
minimum: 0.857654000 sec real / 0.000018043 sec CPU
maximum: 1.393263000 sec real / 0.000036258 sec CPU
average: 0.860786428 sec real / 0.000019350 sec CPU
stdev  : 0.002447096 sec real / 0.000000892 sec CPU

I will certainly retry tonight on Athlon64 X2 when I get
physically to it.  In any case e.g. SPEC numbers would
be interesting too.


Attachment: P
Description: Text document

Attachment: test-memcpy.vanilla
Description: Text document

Attachment: test-memcpy.patched
Description: Text document

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]