This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, nd <nd at arm dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 4 May 2016 11:20:49 -0700
- Subject: Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
- Authentication-results: sourceware.org; auth=none
- References: <AM3PR08MB00888058CAFD723F21D2342D837B0 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com> <572A3B9C dot 3080803 at linaro dot org>
On Wed, May 4, 2016 at 11:12 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 04/05/2016 14:17, Wilco Dijkstra wrote:
>> Adhemerval Zanella wrote:
>>> Right, but I *think* compiler would be smart enough to just avoid the extra spilling.
>>> Take this example for instance [1], using GCC 5.3 for s390x I see no difference in
>>> generated assembly if I the strategy I proposed (-DMEMPCPY_TO_MEMCPY) to
>>> the s390 specific you are suggesting. In the end, I am proposing that architecture
>>> specific micro-optimization should be avoid in favor of a more specific one.
>>> Specially the one that tend to avoid one or two extra spilling based on quite complex
>>> macro expansion. [1] http://pastie.org/10824072
>>
>> You need to use something like this to show the difference:
>>
>> return __mempcpy (__mempcpy (__mempcpy (p1, s, len), p2, 1), p3, 16);
>>
>> GCC doesn't even optimize mempcpy of constant size (PR70140), so if you do have
>> an optimized mempcpy like s390 here, you *still* need to use memcpy for small immediate
>> sizes (so they get inlined), and only use mempcpy for unknown or very large sizes.
>>
>> We end up having to do these header tricks because GCC doesn't implement mempcpy
>> as a first-class builtin or allow targets to defer to memcpy.
>>
>> There are similar issues with strchr (s, 0) being used instead of the faster strlen (s) + s.
>>
>> Wilco
>
> But my point is all the architectures which provide an optimized mempcpy is
> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
X86-64 doesn't do that after
commit c365e615f7429aee302f8af7bf07ae262278febb
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Mon Mar 28 13:13:36 2016 -0700
Implement x86-64 multiarch mempcpy in memcpy
Implement x86-64 multiarch mempcpy in memcpy to share most of code. It
reduces code size of libc.so.
[BZ #18858]
> 3. using a similar strategy for both implementations (powerpc).
>
> So for this change I am proposing compiler support won't be required because both
> memcpy and __mempcpy will be transformed to memcpy + s. Based on assumption that
> memcpy is fast as mempcpy implementation I think there is no need to just add
> this micro-optimization to only s390, but rather make is general.
--
H.J.