This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Review decision to inline mempcpy to memcpy.
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, Ondrej Bilka <neleai at seznam dot cz>, "Joseph S. Myers" <joseph at codesourcery dot com>, Jakub Jelinek <jakub at redhat dot com>, Jeff Law <law at redhat dot com>
- Cc: nd <nd at arm dot com>
- Date: Fri, 4 Mar 2016 20:20:39 +0000
- Subject: Re: Review decision to inline mempcpy to memcpy.
- Authentication-results: sourceware.org; auth=none
- Nodisclaimer: True
- References: <56D856F2 dot 4020000 at redhat dot com>,<AM3PR08MB0088D8CBEE224AA54E620F6983BE0 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:23
Hi,
(resend to post to GLIBC list too)
> Were the changes in glibc to optimize mempcpy as memcpy
> originally motivated by performance for ARM?
OK, so the goal behind this was to provide the best possible out of the box performance
in GLIBC without requiring all targets to write a lot of assembler code. For less
frequently used functions which are identical to a standard function but with a different
return value the most obvious and efficient implementation is to inline a call to the most
commonly used version. There are several good reasons to do this for mempcpy:
1. Few targets implement mempcpy.S, so currently use the slow mempcpy.c veneer.
2. On most targets merging mempcpy into memcpy looks impossible without
slowing down memcpy as a result.
3. Adding a separate mempcpy.S implementation increases I-cache pressure as
now you need to load mempcpy too even if memcpy is already resident in L1/L2 cache.
4. GCC doesn't optimize/inline mempcpy as well as it does memcpy (see below)
> The crux of the argument is that the compiler may be able
> to do a better job of optimizing if it knows the call was
> a mempcpy as opposed to memcpy + addition.
No, unfortunately even GCC6 optimizes memcpy better than mempcpy:
return __builtin_memcpy(x, y, 32);
ldp x4, x5, [x1]
stp x4, x5, [x0]
ldp x4, x5, [x1, 16]
stp x4, x5, [x0, 16]
ret
return __builtin_mempcpy(x, y, 32);
mov x2, 32
b mempcpy
return mempcpy(x, y, 32); // using GLIBC2.23 inline
mov x2, x0
add x0, x0, 32
ldp x4, x5, [x1]
stp x4, x5, [x2]
ldp x4, x5, [x1, 16]
stp x4, x5, [x2, 16]
ret
So the only case where I can see a clear win for mempcpy is if you can do a good
merged implementation and GCC is fixed to optimize mempcpy always in exactly the
same way as memcpy. In that case just define _HAVE_STRING_ARCH_mempcpy.
If we can get GCC to do the right thing depending of the preference of the target and
library then things would be perfect.
Cheers,
Wilco