This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Add x86-64 memmove with unaligned load/store and rep movsb
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Tue, 29 Mar 2016 11:04:26 -0700
- Subject: Re: [PATCH] Add x86-64 memmove with unaligned load/store and rep movsb
- Authentication-results: sourceware.org; auth=none
- References: <CAMe9rOopQ5rUGgH2vu9Xwe02Qw0UNrVNCNOAakiV7h0ukciMtQ at mail dot gmail dot com> <56FABE63 dot 2040705 at redhat dot com>
On Tue, Mar 29, 2016 at 10:41 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 03/29/2016 12:58 PM, H.J. Lu wrote:
>> The goal of this patch is to replace SSE2 memcpy.S,
>> memcpy-avx-unaligned.S and memmove-avx-unaligned.S as well as
>> provide SSE2 memmove with faster alternatives. bench-memcpy and
>> bench-memmove data on various Intel and AMD processors are at
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=19776
>>
>> Any comments, feedbacks?
>
> I assume this is a WIP? I don't see how this code replaces the memcpy@GLIBC_2.14
> IFUNC we're currently using, or redirects the IFUNC to use your new functions
> under certain conditions.
>
> For memcpy:
>
> * On ivybridge the new code regresses 9% mean performance versus AVX usage?
Ivy Bridge currently uses__memcpy_sse2_unaligned. The new one
will be __memcpy_sse2_unaligned_erms, not __memcpy_avx_unaligned_erms.
> * On penryn the new code regresses 18% mean performance versus SSE2 usage?
Penryn will sick with __memcpy_ssse3_back.
> * On bulldozer the new code regresses 18% mean performance versus AVX usage,
> and 3% versus SSE2 usage?
Bulldozer will stick with __memcpy_ssse3.
> This means that out of 11 hardware configurations the patch regresses 4
> of those configurations, while progressing 7. If all devices are of equal
> value, then this change is of mixed benefit.
>
> Which is a mean improvement of 14% in the cases which improved, and a mean
> degradation of 12% in the cases which had worse performance.
>
> This seems like a bad change for Ivybridge, Penry, and Bulldozer.
>
> Can you explain the loss of performance in terms of the hardware that is
> impacted, why did it do worse?
>
> Is it possible to limit the change to those key architectures where the
> optimizations make a difference? Are you trying to avoid the maintenance
> burden of yet another set of optimized routines?
>
As I said, the new one will replace the old one. That is the new SSE2/AVX
replaces the old SSE2/AVX. It won't change the choice of SSE2, SSSE3
nor AVX for a given processor.
--
H.J.