This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] [BZ #18880] Wrong selector in x86_64/multiarch/memcpy.S
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Sat, 29 Aug 2015 09:52:38 +0200
- Subject: Re: [PATCH] [BZ #18880] Wrong selector in x86_64/multiarch/memcpy.S
- Authentication-results: sourceware.org; auth=none
- References: <20150828130553 dot GA14875 at gmail dot com>
On Fri, Aug 28, 2015 at 06:05:53AM -0700, H.J. Lu wrote:
> For x86-64 memcpy/mempcpy, we choose the best implementation by the
> order:
>
> 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
> 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
> 3. __memcpy_sse2 if SSSE3 isn't available.
> 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
> 5. __memcpy_ssse3
>
> In libc.a and ld.so, we choose __memcpy_sse2_unaligned which is optimized
> for current Intel and AMD x86-64 processors.
>
> OK for master?
>
This has several unrelated changes. First is moving files to make new
default which looks ok but produces large diff that hides other changes.
Second is mempcpy support. I also had some patches that do it along with with better
memcpy, I could resend these. As this one it looks reasonable but we
could do better. As mempcpy is rarely used a function to jump after we
setup return value should be used.
Third is ifunc selection. Problem is that what you do is wrong. I had in
my todo list comment: Fix ssse3 memcpy and remove ifunc hack.
There were some problems on atom that I don't recall but when I look to
graph of it a sse2 implmentation looks better till around 400 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_gcc/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_rand/result.html
When I tested it on around half of applications on core2 memcpy_ssse3 was slower
than even memcpy_sse2 so I wrote separate patch to fix that performance regression.
So this is more theoretical as while ssse3 is faster on longer inputs a sse2 and
sse2_unaligned are faster on shorter inputs. So changing it now would
help some which use mainly long ones but harm other applications.
The __memcpy/mempcpy should be just deleted, we set that bit only
for i3/i5/i7 where also set fast_unaligned_load so these are not used. A
memmove is case as well when we check my patch that implements unaligned
memmove.
Finally when there is no ssse3 then memcpy_sse2_unaligned is faster
again as memcpy_sse2 name lie. It doesn't do sse2 moves, only 8 byte
ones that makes it around 30% slower on larger inputs for phenomII that
I tested and also slower on gcc workload. Again I wrote patch that fixes
that by adding variant that does sse2 loads/stores with shifts. Then we
could drop sse2 default.
I did quick retesting here
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old.html
with this profiler where I for simplicity used memmove variants of ssse3 routines.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old290815.tar.bz2