This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Review decision to inline mempcpy to memcpy.

From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
To: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, Ondrej Bilka <neleai at seznam dot cz>, "Joseph S. Myers" <joseph at codesourcery dot com>, Jakub Jelinek <jakub at redhat dot com>, Jeff Law <law at redhat dot com>
Cc: nd <nd at arm dot com>
Date: Fri, 4 Mar 2016 20:20:39 +0000
Subject: Re: Review decision to inline mempcpy to memcpy.
Authentication-results: sourceware.org; auth=none
Nodisclaimer: True
References: <56D856F2 dot 4020000 at redhat dot com>,<AM3PR08MB0088D8CBEE224AA54E620F6983BE0 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:23

Hi,

(resend to post to GLIBC list too)

> Were the changes in glibc to optimize mempcpy as memcpy
> originally motivated by performance for ARM?

OK, so the goal behind this was to provide the best possible out of the box performance
in GLIBC without requiring all targets to write a lot of assembler code. For less
frequently used functions which are identical to a standard function but with a different
return value the most obvious and efficient implementation is to inline a call to the most
commonly used version. There are several good reasons to do this for mempcpy:

1. Few targets implement mempcpy.S, so currently use the slow mempcpy.c veneer.
2. On most targets merging mempcpy into memcpy looks impossible without
   slowing down memcpy as a result.
3. Adding a separate mempcpy.S implementation increases I-cache pressure as
   now you need to load mempcpy too even if memcpy is already resident in L1/L2 cache.
4. GCC doesn't optimize/inline mempcpy as well as it does memcpy (see below)

> The crux of the argument is that the compiler may be able
> to do a better job of optimizing if it knows the call was
> a mempcpy as opposed to memcpy + addition.

No, unfortunately even GCC6 optimizes memcpy better than mempcpy:

return __builtin_memcpy(x, y, 32);

        ldp     x4, x5, [x1]
        stp     x4, x5, [x0]
        ldp     x4, x5, [x1, 16]
        stp     x4, x5, [x0, 16]
        ret

return __builtin_mempcpy(x, y, 32);

        mov     x2, 32
        b       mempcpy

return mempcpy(x, y, 32);  // using GLIBC2.23 inline

        mov     x2, x0
        add     x0, x0, 32
        ldp     x4, x5, [x1]
        stp     x4, x5, [x2]
        ldp     x4, x5, [x1, 16]
        stp     x4, x5, [x2, 16]
        ret

So the only case where I can see a clear win for mempcpy is if you can do a good
merged implementation and GCC is fixed to optimize mempcpy always in exactly the
same way as memcpy. In that case just define _HAVE_STRING_ARCH_mempcpy.

If we can get GCC to do the right thing depending of the preference of the target and
library then things would be perfect.

Cheers,
Wilco

References:
- Review decision to inline mempcpy to memcpy.
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]