This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PING][PATCH neleai/string-x64] Improve strcpy sse2 and avx2 implementation
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: libc-alpha at sourceware dot org
- Date: Wed, 24 Jun 2015 10:13:31 +0200
- Subject: [PING][PATCH neleai/string-x64] Improve strcpy sse2 and avx2 implementation
- Authentication-results: sourceware.org; auth=none
- References: <20150617180105 dot GA26497 at domone>
On Wed, Jun 17, 2015 at 08:01:05PM +0200, OndÅej BÃlka wrote:
> Hi,
>
> I wrote new strcpy on x64 and for some reason I thought that I had
> commited it and forgot to ping it.
>
> As there are other routines that I could improve I will use branch
> neleai/string-x64 to collect these.
>
> Here is revised version of what I sumbitted in 2013. Main change is that
> I now target i7 instead core2 That simplifies things as unaligned loads
> are cheap instead bit slower than aligned ones on core2. That mainly
> concerns header as for core2 you could get better performance by
> aligning loads or stores to 16 bytes after first bytes were read. I do
> not know whats better I would need to test it.
>
> That also makes less important support of ssse3 variant. I could send it
> but it was one of my list on TODO list that now probably lost
> importance. Problem is that on x64 for aligning by ssse3 or sse2 with
> shifts you need to make 16 loops for each alignment as you don't have
> variable shift. Also it needs to use jump table thats very expensive
> For strcpy thats dubious as it increases instruction cache pressure
> and most copies are small. You would need to do switching from unaligned
> loads to aligning. I needed to do profiling to select correct treshold.
>
> If somebody is interested in optimizing old pentiums4, athlon64 I will
> provide a ssse3 variant that is also 50% faster than current one.
> That is also reason why I omitted drawing current ssse3 implementation
> performance.
>
>
> In this version header first checks 128 bytes unaligned unless they
> cross page boundary. That allows more effective loop as then at end of
> loop we could simply write last 64 bytes instead specialcasing to avoid
> writing before start.
>
> I tried several variants of header, as we first read 16 bytes to xmm0
> register question is if they could be reused. I used evolver to select
> best variant, there was almost no difference in performance between
> these.
>
> Now I do checks for bytes 0-15, then 16-31, then 32-63, then 64-128.
> There is possibility to get some cycles with different grouping, I will
> post later improvement if I could find something.
>
>
> First problem was reading ahead. A rereading 8 bytes looked bit faster
> than move from xmm.
>
> Then I tried when to reuse/reread. In 4-7 byte case it was faster reread
> than using bit shifts to get second half. For 1-3 bytes I use following
> copy with s[0] and s[1] from rdx register with byte shifts.
>
> Test branch vs this branchless that works for i 0,1,2
> d[i] = 0;
> d[i/2] = s[1];
> d[0] = s[0];
>
> I also added a avx2 loop. Problem why I shouldn't use them in headers
> was high latency. I could test if using them for bytes 64-128 would give
> speedup.
>
> As technical issues go I needed to move old strcpy_sse_unaligned
> implementation into strncpy_sse2_unaligned as strncpy is function that
> should be optimized for size, not performance. For now I this will keep
> these unchanged.
>
> As performance these are 15%-30% faster than current one for gcc workload on
> haswell and ivy bridge.
>
> As avx2 version its currently 6% on this workload mainly as its bash and
> has lot of large loads so avx2 loop helps.
>
> I used my profiler to show improvement, see here
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html
>
> and source is here
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile170615.tar.bz2
>
> Comments?
>
> * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
> Add __strcpy_avx2 and __stpcpy_avx2
> * sysdeps/x86_64/multiarch/Makefile (routines): Add stpcpy_avx2.S and
> strcpy_avx2.S
> * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
> * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise.
> * sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S: Refactored
> implementation.
> * sysdeps/x86_64/multiarch/strcpy.S: Updated ifunc.
> * sysdeps/x86_64/multiarch/strncpy.S: Moved from strcpy.S.
> * sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S: Moved
> strcpy-sse2-unaligned.S here.
> * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
> * sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S: Redirect
> from strcpy-sse2-unaligned.S to strncpy-sse2-unaligned.S
> * sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
> * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise.
>
> ---
> sysdeps/x86_64/multiarch/Makefile | 2 +-
> sysdeps/x86_64/multiarch/ifunc-impl-list.c | 2 +
> sysdeps/x86_64/multiarch/stpcpy-avx2.S | 3 +
> sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S | 439 ++++-
> sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S | 3 +-
> sysdeps/x86_64/multiarch/stpncpy.S | 5 +-
> sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S | 2 +-
> sysdeps/x86_64/multiarch/strcpy-avx2.S | 4 +
> sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S | 1890 +-------------------
> sysdeps/x86_64/multiarch/strcpy.S | 22 +-
> sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S | 1891 ++++++++++++++++++++-
> sysdeps/x86_64/multiarch/strncpy.S | 88 +-
> 14 files changed, 2435 insertions(+), 1921 deletions(-)
> create mode 100644 sysdeps/x86_64/multiarch/stpcpy-avx2.S
> create mode 100644 sysdeps/x86_64/multiarch/strcpy-avx2.S
>
>
> diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> index d7002a9..c573744 100644
> --- a/sysdeps/x86_64/multiarch/Makefile
> +++ b/sysdeps/x86_64/multiarch/Makefile
> @@ -29,7 +29,7 @@ CFLAGS-strspn-c.c += -msse4
> endif
>
> ifeq (yes,$(config-cflags-avx2))
> -sysdep_routines += memset-avx2
> +sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2
> endif
> endif
>
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index b64e4f1..d398e43 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -88,6 +88,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>
> /* Support sysdeps/x86_64/multiarch/stpcpy.S. */
> IFUNC_IMPL (i, name, stpcpy,
> + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2)
> IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3)
> IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned)
> IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2))
> @@ -137,6 +138,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>
> /* Support sysdeps/x86_64/multiarch/strcpy.S. */
> IFUNC_IMPL (i, name, strcpy,
> + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2)
> IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3)
> IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned)
> IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2))
> diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> new file mode 100644
> index 0000000..bd30ef6
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> @@ -0,0 +1,3 @@
> +#define USE_AVX2
> +#define STPCPY __stpcpy_avx2
> +#include "stpcpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> index 34231f8..695a236 100644
> --- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> @@ -1,3 +1,436 @@
> -#define USE_AS_STPCPY
> -#define STRCPY __stpcpy_sse2_unaligned
> -#include "strcpy-sse2-unaligned.S"
> +/* stpcpy with SSE2 and unaligned load
> + Copyright (C) 2015 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +#ifndef STPCPY
> +# define STPCPY __stpcpy_sse2_unaligned
> +#endif
> +
> +ENTRY(STPCPY)
> + mov %esi, %edx
> +#ifdef AS_STRCPY
> + movq %rdi, %rax
> +#endif
> + pxor %xmm4, %xmm4
> + pxor %xmm5, %xmm5
> + andl $4095, %edx
> + cmp $3968, %edx
> + ja L(cross_page)
> +
> + movdqu (%rsi), %xmm0
> + pcmpeqb %xmm0, %xmm4
> + pmovmskb %xmm4, %edx
> + testl %edx, %edx
> + je L(more16bytes)
> + bsf %edx, %ecx
> +#ifndef AS_STRCPY
> + lea (%rdi, %rcx), %rax
> +#endif
> + cmp $7, %ecx
> + movq (%rsi), %rdx
> + jb L(less_8_bytesb)
> +L(8bytes_from_cross):
> + movq -7(%rsi, %rcx), %rsi
> + movq %rdx, (%rdi)
> +#ifdef AS_STRCPY
> + movq %rsi, -7(%rdi, %rcx)
> +#else
> + movq %rsi, -7(%rax)
> +#endif
> + ret
> +
> + .p2align 4
> +L(less_8_bytesb):
> + cmp $2, %ecx
> + jbe L(less_4_bytes)
> +L(4bytes_from_cross):
> + mov -3(%rsi, %rcx), %esi
> + mov %edx, (%rdi)
> +#ifdef AS_STRCPY
> + mov %esi, -3(%rdi, %rcx)
> +#else
> + mov %esi, -3(%rax)
> +#endif
> + ret
> +
> +.p2align 4
> + L(less_4_bytes):
> + /*
> + Test branch vs this branchless that works for i 0,1,2
> + d[i] = 0;
> + d[i/2] = s[1];
> + d[0] = s[0];
> + */
> +#ifdef AS_STRCPY
> + movb $0, (%rdi, %rcx)
> +#endif
> +
> + shr $1, %ecx
> + mov %edx, %esi
> + shr $8, %edx
> + movb %dl, (%rdi, %rcx)
> +#ifndef AS_STRCPY
> + movb $0, (%rax)
> +#endif
> + movb %sil, (%rdi)
> + ret
> +
> +
> +
> +
> +
> + .p2align 4
> +L(more16bytes):
> + pxor %xmm6, %xmm6
> + movdqu 16(%rsi), %xmm1
> + pxor %xmm7, %xmm7
> + pcmpeqb %xmm1, %xmm5
> + pmovmskb %xmm5, %edx
> + testl %edx, %edx
> + je L(more32bytes)
> + bsf %edx, %edx
> +#ifdef AS_STRCPY
> + movdqu 1(%rsi, %rdx), %xmm1
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm1, 1(%rdi, %rdx)
> +#else
> + lea 16(%rdi, %rdx), %rax
> + movdqu 1(%rsi, %rdx), %xmm1
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm1, -15(%rax)
> +#endif
> + ret
> +
> + .p2align 4
> +L(more32bytes):
> + movdqu 32(%rsi), %xmm2
> + movdqu 48(%rsi), %xmm3
> +
> + pcmpeqb %xmm2, %xmm6
> + pcmpeqb %xmm3, %xmm7
> + pmovmskb %xmm7, %edx
> + shl $16, %edx
> + pmovmskb %xmm6, %ecx
> + or %ecx, %edx
> + je L(more64bytes)
> + bsf %edx, %edx
> +#ifndef AS_STRCPY
> + lea 32(%rdi, %rdx), %rax
> +#endif
> + movdqu 1(%rsi, %rdx), %xmm2
> + movdqu 17(%rsi, %rdx), %xmm3
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm1, 16(%rdi)
> +#ifdef AS_STRCPY
> + movdqu %xmm2, 1(%rdi, %rdx)
> + movdqu %xmm3, 17(%rdi, %rdx)
> +#else
> + movdqu %xmm2, -31(%rax)
> + movdqu %xmm3, -15(%rax)
> +#endif
> + ret
> +
> + .p2align 4
> +L(more64bytes):
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm1, 16(%rdi)
> + movdqu %xmm2, 32(%rdi)
> + movdqu %xmm3, 48(%rdi)
> + movdqu 64(%rsi), %xmm0
> + movdqu 80(%rsi), %xmm1
> + movdqu 96(%rsi), %xmm2
> + movdqu 112(%rsi), %xmm3
> +
> + pcmpeqb %xmm0, %xmm4
> + pcmpeqb %xmm1, %xmm5
> + pcmpeqb %xmm2, %xmm6
> + pcmpeqb %xmm3, %xmm7
> + pmovmskb %xmm4, %ecx
> + pmovmskb %xmm5, %edx
> + pmovmskb %xmm6, %r8d
> + pmovmskb %xmm7, %r9d
> + shl $16, %edx
> + or %ecx, %edx
> + shl $32, %r8
> + shl $48, %r9
> + or %r8, %rdx
> + or %r9, %rdx
> + test %rdx, %rdx
> + je L(prepare_loop)
> + bsf %rdx, %rdx
> +#ifndef AS_STRCPY
> + lea 64(%rdi, %rdx), %rax
> +#endif
> + movdqu 1(%rsi, %rdx), %xmm0
> + movdqu 17(%rsi, %rdx), %xmm1
> + movdqu 33(%rsi, %rdx), %xmm2
> + movdqu 49(%rsi, %rdx), %xmm3
> +#ifdef AS_STRCPY
> + movdqu %xmm0, 1(%rdi, %rdx)
> + movdqu %xmm1, 17(%rdi, %rdx)
> + movdqu %xmm2, 33(%rdi, %rdx)
> + movdqu %xmm3, 49(%rdi, %rdx)
> +#else
> + movdqu %xmm0, -63(%rax)
> + movdqu %xmm1, -47(%rax)
> + movdqu %xmm2, -31(%rax)
> + movdqu %xmm3, -15(%rax)
> +#endif
> + ret
> +
> +
> + .p2align 4
> +L(prepare_loop):
> + movdqu %xmm0, 64(%rdi)
> + movdqu %xmm1, 80(%rdi)
> + movdqu %xmm2, 96(%rdi)
> + movdqu %xmm3, 112(%rdi)
> +
> + subq %rsi, %rdi
> + add $64, %rsi
> + andq $-64, %rsi
> + addq %rsi, %rdi
> + jmp L(loop_entry)
> +
> +#ifdef USE_AVX2
> + .p2align 4
> +L(loop):
> + vmovdqu %ymm1, (%rdi)
> + vmovdqu %ymm3, 32(%rdi)
> +L(loop_entry):
> + vmovdqa 96(%rsi), %ymm3
> + vmovdqa 64(%rsi), %ymm1
> + vpminub %ymm3, %ymm1, %ymm2
> + addq $64, %rsi
> + addq $64, %rdi
> + vpcmpeqb %ymm5, %ymm2, %ymm0
> + vpmovmskb %ymm0, %edx
> + test %edx, %edx
> + je L(loop)
> + salq $32, %rdx
> + vpcmpeqb %ymm5, %ymm1, %ymm4
> + vpmovmskb %ymm4, %ecx
> + or %rcx, %rdx
> + bsfq %rdx, %rdx
> +#ifndef AS_STRCPY
> + lea (%rdi, %rdx), %rax
> +#endif
> + vmovdqu -63(%rsi, %rdx), %ymm0
> + vmovdqu -31(%rsi, %rdx), %ymm2
> +#ifdef AS_STRCPY
> + vmovdqu %ymm0, -63(%rdi, %rdx)
> + vmovdqu %ymm2, -31(%rdi, %rdx)
> +#else
> + vmovdqu %ymm0, -63(%rax)
> + vmovdqu %ymm2, -31(%rax)
> +#endif
> + vzeroupper
> + ret
> +#else
> + .p2align 4
> +L(loop):
> + movdqu %xmm1, (%rdi)
> + movdqu %xmm2, 16(%rdi)
> + movdqu %xmm3, 32(%rdi)
> + movdqu %xmm4, 48(%rdi)
> +L(loop_entry):
> + movdqa 96(%rsi), %xmm3
> + movdqa 112(%rsi), %xmm4
> + movdqa %xmm3, %xmm0
> + movdqa 80(%rsi), %xmm2
> + pminub %xmm4, %xmm0
> + movdqa 64(%rsi), %xmm1
> + pminub %xmm2, %xmm0
> + pminub %xmm1, %xmm0
> + addq $64, %rsi
> + addq $64, %rdi
> + pcmpeqb %xmm5, %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + je L(loop)
> + salq $48, %rdx
> + pcmpeqb %xmm1, %xmm5
> + pcmpeqb %xmm2, %xmm6
> + pmovmskb %xmm5, %ecx
> +#ifdef AS_STRCPY
> + pmovmskb %xmm6, %r8d
> + pcmpeqb %xmm3, %xmm7
> + pmovmskb %xmm7, %r9d
> + sal $16, %r8d
> + or %r8d, %ecx
> +#else
> + pmovmskb %xmm6, %eax
> + pcmpeqb %xmm3, %xmm7
> + pmovmskb %xmm7, %r9d
> + sal $16, %eax
> + or %eax, %ecx
> +#endif
> + salq $32, %r9
> + orq %rcx, %rdx
> + orq %r9, %rdx
> + bsfq %rdx, %rdx
> +#ifndef AS_STRCPY
> + lea (%rdi, %rdx), %rax
> +#endif
> + movdqu -63(%rsi, %rdx), %xmm0
> + movdqu -47(%rsi, %rdx), %xmm1
> + movdqu -31(%rsi, %rdx), %xmm2
> + movdqu -15(%rsi, %rdx), %xmm3
> +#ifdef AS_STRCPY
> + movdqu %xmm0, -63(%rdi, %rdx)
> + movdqu %xmm1, -47(%rdi, %rdx)
> + movdqu %xmm2, -31(%rdi, %rdx)
> + movdqu %xmm3, -15(%rdi, %rdx)
> +#else
> + movdqu %xmm0, -63(%rax)
> + movdqu %xmm1, -47(%rax)
> + movdqu %xmm2, -31(%rax)
> + movdqu %xmm3, -15(%rax)
> +#endif
> + ret
> +#endif
> +
> + .p2align 4
> +L(cross_page):
> + movq %rsi, %rcx
> + pxor %xmm0, %xmm0
> + and $15, %ecx
> + movq %rsi, %r9
> + movq %rdi, %r10
> + subq %rcx, %rsi
> + subq %rcx, %rdi
> + movdqa (%rsi), %xmm1
> + pcmpeqb %xmm0, %xmm1
> + pmovmskb %xmm1, %edx
> + shr %cl, %edx
> + shl %cl, %edx
> + test %edx, %edx
> + jne L(less_32_cross)
> +
> + addq $16, %rsi
> + addq $16, %rdi
> + movdqa (%rsi), %xmm1
> + pcmpeqb %xmm1, %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jne L(less_32_cross)
> + movdqu %xmm1, (%rdi)
> +
> + movdqu (%r9), %xmm0
> + movdqu %xmm0, (%r10)
> +
> + mov $8, %rcx
> +L(cross_loop):
> + addq $16, %rsi
> + addq $16, %rdi
> + pxor %xmm0, %xmm0
> + movdqa (%rsi), %xmm1
> + pcmpeqb %xmm1, %xmm0
> + pmovmskb %xmm0, %edx
> + test %edx, %edx
> + jne L(return_cross)
> + movdqu %xmm1, (%rdi)
> + sub $1, %rcx
> + ja L(cross_loop)
> +
> + pxor %xmm5, %xmm5
> + pxor %xmm6, %xmm6
> + pxor %xmm7, %xmm7
> +
> + lea -64(%rsi), %rdx
> + andq $-64, %rdx
> + addq %rdx, %rdi
> + subq %rsi, %rdi
> + movq %rdx, %rsi
> + jmp L(loop_entry)
> +
> + .p2align 4
> +L(return_cross):
> + bsf %edx, %edx
> +#ifdef AS_STRCPY
> + movdqu -15(%rsi, %rdx), %xmm0
> + movdqu %xmm0, -15(%rdi, %rdx)
> +#else
> + lea (%rdi, %rdx), %rax
> + movdqu -15(%rsi, %rdx), %xmm0
> + movdqu %xmm0, -15(%rax)
> +#endif
> + ret
> +
> + .p2align 4
> +L(less_32_cross):
> + bsf %rdx, %rdx
> + lea (%rdi, %rdx), %rcx
> +#ifndef AS_STRCPY
> + mov %rcx, %rax
> +#endif
> + mov %r9, %rsi
> + mov %r10, %rdi
> + sub %rdi, %rcx
> + cmp $15, %ecx
> + jb L(less_16_cross)
> + movdqu (%rsi), %xmm0
> + movdqu -15(%rsi, %rcx), %xmm1
> + movdqu %xmm0, (%rdi)
> +#ifdef AS_STRCPY
> + movdqu %xmm1, -15(%rdi, %rcx)
> +#else
> + movdqu %xmm1, -15(%rax)
> +#endif
> + ret
> +
> +L(less_16_cross):
> + cmp $7, %ecx
> + jb L(less_8_bytes_cross)
> + movq (%rsi), %rdx
> + jmp L(8bytes_from_cross)
> +
> +L(less_8_bytes_cross):
> + cmp $2, %ecx
> + jbe L(3_bytes_cross)
> + mov (%rsi), %edx
> + jmp L(4bytes_from_cross)
> +
> +L(3_bytes_cross):
> + jb L(1_2bytes_cross)
> + movzwl (%rsi), %edx
> + jmp L(_3_bytesb)
> +
> +L(1_2bytes_cross):
> + movb (%rsi), %dl
> + jmp L(0_2bytes_from_cross)
> +
> + .p2align 4
> +L(less_4_bytesb):
> + je L(_3_bytesb)
> +L(0_2bytes_from_cross):
> + movb %dl, (%rdi)
> +#ifdef AS_STRCPY
> + movb $0, (%rdi, %rcx)
> +#else
> + movb $0, (%rax)
> +#endif
> + ret
> +
> + .p2align 4
> +L(_3_bytesb):
> + movw %dx, (%rdi)
> + movb $0, 2(%rdi)
> + ret
> +
> +END(STPCPY)
> diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> index 658520f..3f35068 100644
> --- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> @@ -1,4 +1,3 @@
> #define USE_AS_STPCPY
> -#define USE_AS_STRNCPY
> #define STRCPY __stpncpy_sse2_unaligned
> -#include "strcpy-sse2-unaligned.S"
> +#include "strncpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S
> index 2698ca6..159604a 100644
> --- a/sysdeps/x86_64/multiarch/stpncpy.S
> +++ b/sysdeps/x86_64/multiarch/stpncpy.S
> @@ -1,8 +1,7 @@
> /* Multiple versions of stpncpy
> All versions must be listed in ifunc-impl-list.c. */
> -#define STRCPY __stpncpy
> +#define STRNCPY __stpncpy
> #define USE_AS_STPCPY
> -#define USE_AS_STRNCPY
> -#include "strcpy.S"
> +#include "strncpy.S"
>
> weak_alias (__stpncpy, stpncpy)
> diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> index 81f1b40..1faa49d 100644
> --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> @@ -275,5 +275,5 @@ L(StartStrcpyPart):
> # define USE_AS_STRNCPY
> # endif
>
> -# include "strcpy-sse2-unaligned.S"
> +# include "strncpy-sse2-unaligned.S"
> #endif
> diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> new file mode 100644
> index 0000000..a3133a4
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> @@ -0,0 +1,4 @@
> +#define USE_AVX2
> +#define AS_STRCPY
> +#define STPCPY __strcpy_avx2
> +#include "stpcpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> index 8f03d1d..310e4fa 100644
> --- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> @@ -1,1887 +1,3 @@
> -/* strcpy with SSE2 and unaligned load
> - Copyright (C) 2011-2015 Free Software Foundation, Inc.
> - Contributed by Intel Corporation.
> - This file is part of the GNU C Library.
> -
> - The GNU C Library is free software; you can redistribute it and/or
> - modify it under the terms of the GNU Lesser General Public
> - License as published by the Free Software Foundation; either
> - version 2.1 of the License, or (at your option) any later version.
> -
> - The GNU C Library is distributed in the hope that it will be useful,
> - but WITHOUT ANY WARRANTY; without even the implied warranty of
> - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - Lesser General Public License for more details.
> -
> - You should have received a copy of the GNU Lesser General Public
> - License along with the GNU C Library; if not, see
> - <http://www.gnu.org/licenses/>. */
> -
> -#if IS_IN (libc)
> -
> -# ifndef USE_AS_STRCAT
> -# include <sysdep.h>
> -
> -# ifndef STRCPY
> -# define STRCPY __strcpy_sse2_unaligned
> -# endif
> -
> -# endif
> -
> -# define JMPTBL(I, B) I - B
> -# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \
> - lea TABLE(%rip), %r11; \
> - movslq (%r11, INDEX, SCALE), %rcx; \
> - lea (%r11, %rcx), %rcx; \
> - jmp *%rcx
> -
> -# ifndef USE_AS_STRCAT
> -
> -.text
> -ENTRY (STRCPY)
> -# ifdef USE_AS_STRNCPY
> - mov %rdx, %r8
> - test %r8, %r8
> - jz L(ExitZero)
> -# endif
> - mov %rsi, %rcx
> -# ifndef USE_AS_STPCPY
> - mov %rdi, %rax /* save result */
> -# endif
> -
> -# endif
> -
> - and $63, %rcx
> - cmp $32, %rcx
> - jbe L(SourceStringAlignmentLess32)
> -
> - and $-16, %rsi
> - and $15, %rcx
> - pxor %xmm0, %xmm0
> - pxor %xmm1, %xmm1
> -
> - pcmpeqb (%rsi), %xmm1
> - pmovmskb %xmm1, %rdx
> - shr %cl, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> - mov $16, %r10
> - sub %rcx, %r10
> - cmp %r10, %r8
> -# else
> - mov $17, %r10
> - sub %rcx, %r10
> - cmp %r10, %r8
> -# endif
> - jbe L(CopyFrom1To16BytesTailCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> - jnz L(CopyFrom1To16BytesTail)
> -
> - pcmpeqb 16(%rsi), %xmm0
> - pmovmskb %xmm0, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> - add $16, %r10
> - cmp %r10, %r8
> - jbe L(CopyFrom1To32BytesCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> - jnz L(CopyFrom1To32Bytes)
> -
> - movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */
> - movdqu %xmm1, (%rdi)
> -
> -/* If source address alignment != destination address alignment */
> - .p2align 4
> -L(Unalign16Both):
> - sub %rcx, %rdi
> -# ifdef USE_AS_STRNCPY
> - add %rcx, %r8
> -# endif
> - mov $16, %rcx
> - movdqa (%rsi, %rcx), %xmm1
> - movaps 16(%rsi, %rcx), %xmm2
> - movdqu %xmm1, (%rdi, %rcx)
> - pcmpeqb %xmm2, %xmm0
> - pmovmskb %xmm0, %rdx
> - add $16, %rcx
> -# ifdef USE_AS_STRNCPY
> - sub $48, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm2)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - movaps 16(%rsi, %rcx), %xmm3
> - movdqu %xmm2, (%rdi, %rcx)
> - pcmpeqb %xmm3, %xmm0
> - pmovmskb %xmm0, %rdx
> - add $16, %rcx
> -# ifdef USE_AS_STRNCPY
> - sub $16, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm3)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - movaps 16(%rsi, %rcx), %xmm4
> - movdqu %xmm3, (%rdi, %rcx)
> - pcmpeqb %xmm4, %xmm0
> - pmovmskb %xmm0, %rdx
> - add $16, %rcx
> -# ifdef USE_AS_STRNCPY
> - sub $16, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm4)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - movaps 16(%rsi, %rcx), %xmm1
> - movdqu %xmm4, (%rdi, %rcx)
> - pcmpeqb %xmm1, %xmm0
> - pmovmskb %xmm0, %rdx
> - add $16, %rcx
> -# ifdef USE_AS_STRNCPY
> - sub $16, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm1)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - movaps 16(%rsi, %rcx), %xmm2
> - movdqu %xmm1, (%rdi, %rcx)
> - pcmpeqb %xmm2, %xmm0
> - pmovmskb %xmm0, %rdx
> - add $16, %rcx
> -# ifdef USE_AS_STRNCPY
> - sub $16, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm2)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - movaps 16(%rsi, %rcx), %xmm3
> - movdqu %xmm2, (%rdi, %rcx)
> - pcmpeqb %xmm3, %xmm0
> - pmovmskb %xmm0, %rdx
> - add $16, %rcx
> -# ifdef USE_AS_STRNCPY
> - sub $16, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm3)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - movdqu %xmm3, (%rdi, %rcx)
> - mov %rsi, %rdx
> - lea 16(%rsi, %rcx), %rsi
> - and $-0x40, %rsi
> - sub %rsi, %rdx
> - sub %rdx, %rdi
> -# ifdef USE_AS_STRNCPY
> - lea 128(%r8, %rdx), %r8
> -# endif
> -L(Unaligned64Loop):
> - movaps (%rsi), %xmm2
> - movaps %xmm2, %xmm4
> - movaps 16(%rsi), %xmm5
> - movaps 32(%rsi), %xmm3
> - movaps %xmm3, %xmm6
> - movaps 48(%rsi), %xmm7
> - pminub %xmm5, %xmm2
> - pminub %xmm7, %xmm3
> - pminub %xmm2, %xmm3
> - pcmpeqb %xmm0, %xmm3
> - pmovmskb %xmm3, %rdx
> -# ifdef USE_AS_STRNCPY
> - sub $64, %r8
> - jbe L(UnalignedLeaveCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> - jnz L(Unaligned64Leave)
> -
> -L(Unaligned64Loop_start):
> - add $64, %rdi
> - add $64, %rsi
> - movdqu %xmm4, -64(%rdi)
> - movaps (%rsi), %xmm2
> - movdqa %xmm2, %xmm4
> - movdqu %xmm5, -48(%rdi)
> - movaps 16(%rsi), %xmm5
> - pminub %xmm5, %xmm2
> - movaps 32(%rsi), %xmm3
> - movdqu %xmm6, -32(%rdi)
> - movaps %xmm3, %xmm6
> - movdqu %xmm7, -16(%rdi)
> - movaps 48(%rsi), %xmm7
> - pminub %xmm7, %xmm3
> - pminub %xmm2, %xmm3
> - pcmpeqb %xmm0, %xmm3
> - pmovmskb %xmm3, %rdx
> -# ifdef USE_AS_STRNCPY
> - sub $64, %r8
> - jbe L(UnalignedLeaveCase2OrCase3)
> -# endif
> - test %rdx, %rdx
> - jz L(Unaligned64Loop_start)
> -
> -L(Unaligned64Leave):
> - pxor %xmm1, %xmm1
> -
> - pcmpeqb %xmm4, %xmm0
> - pcmpeqb %xmm5, %xmm1
> - pmovmskb %xmm0, %rdx
> - pmovmskb %xmm1, %rcx
> - test %rdx, %rdx
> - jnz L(CopyFrom1To16BytesUnaligned_0)
> - test %rcx, %rcx
> - jnz L(CopyFrom1To16BytesUnaligned_16)
> -
> - pcmpeqb %xmm6, %xmm0
> - pcmpeqb %xmm7, %xmm1
> - pmovmskb %xmm0, %rdx
> - pmovmskb %xmm1, %rcx
> - test %rdx, %rdx
> - jnz L(CopyFrom1To16BytesUnaligned_32)
> -
> - bsf %rcx, %rdx
> - movdqu %xmm4, (%rdi)
> - movdqu %xmm5, 16(%rdi)
> - movdqu %xmm6, 32(%rdi)
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> - lea 48(%rdi, %rdx), %rax
> -# endif
> - movdqu %xmm7, 48(%rdi)
> - add $15, %r8
> - sub %rdx, %r8
> - lea 49(%rdi, %rdx), %rdi
> - jmp L(StrncpyFillTailWithZero)
> -# else
> - add $48, %rsi
> - add $48, %rdi
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> -/* If source address alignment == destination address alignment */
> -
> -L(SourceStringAlignmentLess32):
> - pxor %xmm0, %xmm0
> - movdqu (%rsi), %xmm1
> - movdqu 16(%rsi), %xmm2
> - pcmpeqb %xmm1, %xmm0
> - pmovmskb %xmm0, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> - cmp $16, %r8
> -# else
> - cmp $17, %r8
> -# endif
> - jbe L(CopyFrom1To16BytesTail1Case2OrCase3)
> -# endif
> - test %rdx, %rdx
> - jnz L(CopyFrom1To16BytesTail1)
> -
> - pcmpeqb %xmm2, %xmm0
> - movdqu %xmm1, (%rdi)
> - pmovmskb %xmm0, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> - cmp $32, %r8
> -# else
> - cmp $33, %r8
> -# endif
> - jbe L(CopyFrom1To32Bytes1Case2OrCase3)
> -# endif
> - test %rdx, %rdx
> - jnz L(CopyFrom1To32Bytes1)
> -
> - and $-16, %rsi
> - and $15, %rcx
> - jmp L(Unalign16Both)
> -
> -/*------End of main part with loops---------------------*/
> -
> -/* Case1 */
> -
> -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> - .p2align 4
> -L(CopyFrom1To16Bytes):
> - add %rcx, %rdi
> - add %rcx, %rsi
> - bsf %rdx, %rdx
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> - .p2align 4
> -L(CopyFrom1To16BytesTail):
> - add %rcx, %rsi
> - bsf %rdx, %rdx
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> - .p2align 4
> -L(CopyFrom1To32Bytes1):
> - add $16, %rsi
> - add $16, %rdi
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $16, %r8
> -# endif
> -L(CopyFrom1To16BytesTail1):
> - bsf %rdx, %rdx
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> - .p2align 4
> -L(CopyFrom1To32Bytes):
> - bsf %rdx, %rdx
> - add %rcx, %rsi
> - add $16, %rdx
> - sub %rcx, %rdx
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnaligned_0):
> - bsf %rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> - lea (%rdi, %rdx), %rax
> -# endif
> - movdqu %xmm4, (%rdi)
> - add $63, %r8
> - sub %rdx, %r8
> - lea 1(%rdi, %rdx), %rdi
> - jmp L(StrncpyFillTailWithZero)
> -# else
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnaligned_16):
> - bsf %rcx, %rdx
> - movdqu %xmm4, (%rdi)
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> - lea 16(%rdi, %rdx), %rax
> -# endif
> - movdqu %xmm5, 16(%rdi)
> - add $47, %r8
> - sub %rdx, %r8
> - lea 17(%rdi, %rdx), %rdi
> - jmp L(StrncpyFillTailWithZero)
> -# else
> - add $16, %rsi
> - add $16, %rdi
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnaligned_32):
> - bsf %rdx, %rdx
> - movdqu %xmm4, (%rdi)
> - movdqu %xmm5, 16(%rdi)
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> - lea 32(%rdi, %rdx), %rax
> -# endif
> - movdqu %xmm6, 32(%rdi)
> - add $31, %r8
> - sub %rdx, %r8
> - lea 33(%rdi, %rdx), %rdi
> - jmp L(StrncpyFillTailWithZero)
> -# else
> - add $32, %rsi
> - add $32, %rdi
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> -# ifdef USE_AS_STRNCPY
> -# ifndef USE_AS_STRCAT
> - .p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm6):
> - movdqu %xmm6, (%rdi, %rcx)
> - jmp L(CopyFrom1To16BytesXmmExit)
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm5):
> - movdqu %xmm5, (%rdi, %rcx)
> - jmp L(CopyFrom1To16BytesXmmExit)
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm4):
> - movdqu %xmm4, (%rdi, %rcx)
> - jmp L(CopyFrom1To16BytesXmmExit)
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm3):
> - movdqu %xmm3, (%rdi, %rcx)
> - jmp L(CopyFrom1To16BytesXmmExit)
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm1):
> - movdqu %xmm1, (%rdi, %rcx)
> - jmp L(CopyFrom1To16BytesXmmExit)
> -# endif
> -
> - .p2align 4
> -L(CopyFrom1To16BytesExit):
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> -/* Case2 */
> -
> - .p2align 4
> -L(CopyFrom1To16BytesCase2):
> - add $16, %r8
> - add %rcx, %rdi
> - add %rcx, %rsi
> - bsf %rdx, %rdx
> - cmp %r8, %rdx
> - jb L(CopyFrom1To16BytesExit)
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> - .p2align 4
> -L(CopyFrom1To32BytesCase2):
> - add %rcx, %rsi
> - bsf %rdx, %rdx
> - add $16, %rdx
> - sub %rcx, %rdx
> - cmp %r8, %rdx
> - jb L(CopyFrom1To16BytesExit)
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -L(CopyFrom1To16BytesTailCase2):
> - add %rcx, %rsi
> - bsf %rdx, %rdx
> - cmp %r8, %rdx
> - jb L(CopyFrom1To16BytesExit)
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -L(CopyFrom1To16BytesTail1Case2):
> - bsf %rdx, %rdx
> - cmp %r8, %rdx
> - jb L(CopyFrom1To16BytesExit)
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -/* Case2 or Case3, Case3 */
> -
> - .p2align 4
> -L(CopyFrom1To16BytesCase2OrCase3):
> - test %rdx, %rdx
> - jnz L(CopyFrom1To16BytesCase2)
> -L(CopyFrom1To16BytesCase3):
> - add $16, %r8
> - add %rcx, %rdi
> - add %rcx, %rsi
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> - .p2align 4
> -L(CopyFrom1To32BytesCase2OrCase3):
> - test %rdx, %rdx
> - jnz L(CopyFrom1To32BytesCase2)
> - add %rcx, %rsi
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> - .p2align 4
> -L(CopyFrom1To16BytesTailCase2OrCase3):
> - test %rdx, %rdx
> - jnz L(CopyFrom1To16BytesTailCase2)
> - add %rcx, %rsi
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> - .p2align 4
> -L(CopyFrom1To32Bytes1Case2OrCase3):
> - add $16, %rdi
> - add $16, %rsi
> - sub $16, %r8
> -L(CopyFrom1To16BytesTail1Case2OrCase3):
> - test %rdx, %rdx
> - jnz L(CopyFrom1To16BytesTail1Case2)
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -# endif
> -
> -/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> -
> - .p2align 4
> -L(Exit1):
> - mov %dh, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea (%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $1, %r8
> - lea 1(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit2):
> - mov (%rsi), %dx
> - mov %dx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 1(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $2, %r8
> - lea 2(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit3):
> - mov (%rsi), %cx
> - mov %cx, (%rdi)
> - mov %dh, 2(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 2(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $3, %r8
> - lea 3(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit4):
> - mov (%rsi), %edx
> - mov %edx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 3(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $4, %r8
> - lea 4(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit5):
> - mov (%rsi), %ecx
> - mov %dh, 4(%rdi)
> - mov %ecx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 4(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $5, %r8
> - lea 5(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit6):
> - mov (%rsi), %ecx
> - mov 4(%rsi), %dx
> - mov %ecx, (%rdi)
> - mov %dx, 4(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 5(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $6, %r8
> - lea 6(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit7):
> - mov (%rsi), %ecx
> - mov 3(%rsi), %edx
> - mov %ecx, (%rdi)
> - mov %edx, 3(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 6(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $7, %r8
> - lea 7(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit8):
> - mov (%rsi), %rdx
> - mov %rdx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 7(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $8, %r8
> - lea 8(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit9):
> - mov (%rsi), %rcx
> - mov %dh, 8(%rdi)
> - mov %rcx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 8(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $9, %r8
> - lea 9(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit10):
> - mov (%rsi), %rcx
> - mov 8(%rsi), %dx
> - mov %rcx, (%rdi)
> - mov %dx, 8(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 9(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $10, %r8
> - lea 10(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit11):
> - mov (%rsi), %rcx
> - mov 7(%rsi), %edx
> - mov %rcx, (%rdi)
> - mov %edx, 7(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 10(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $11, %r8
> - lea 11(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit12):
> - mov (%rsi), %rcx
> - mov 8(%rsi), %edx
> - mov %rcx, (%rdi)
> - mov %edx, 8(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 11(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $12, %r8
> - lea 12(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit13):
> - mov (%rsi), %rcx
> - mov 5(%rsi), %rdx
> - mov %rcx, (%rdi)
> - mov %rdx, 5(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 12(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $13, %r8
> - lea 13(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit14):
> - mov (%rsi), %rcx
> - mov 6(%rsi), %rdx
> - mov %rcx, (%rdi)
> - mov %rdx, 6(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 13(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $14, %r8
> - lea 14(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit15):
> - mov (%rsi), %rcx
> - mov 7(%rsi), %rdx
> - mov %rcx, (%rdi)
> - mov %rdx, 7(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 14(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $15, %r8
> - lea 15(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit16):
> - movdqu (%rsi), %xmm0
> - movdqu %xmm0, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 15(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $16, %r8
> - lea 16(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit17):
> - movdqu (%rsi), %xmm0
> - movdqu %xmm0, (%rdi)
> - mov %dh, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 16(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $17, %r8
> - lea 17(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit18):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %cx
> - movdqu %xmm0, (%rdi)
> - mov %cx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 17(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $18, %r8
> - lea 18(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit19):
> - movdqu (%rsi), %xmm0
> - mov 15(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %ecx, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 18(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $19, %r8
> - lea 19(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit20):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %ecx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 19(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $20, %r8
> - lea 20(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit21):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %ecx, 16(%rdi)
> - mov %dh, 20(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 20(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $21, %r8
> - lea 21(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit22):
> - movdqu (%rsi), %xmm0
> - mov 14(%rsi), %rcx
> - movdqu %xmm0, (%rdi)
> - mov %rcx, 14(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 21(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $22, %r8
> - lea 22(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit23):
> - movdqu (%rsi), %xmm0
> - mov 15(%rsi), %rcx
> - movdqu %xmm0, (%rdi)
> - mov %rcx, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 22(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $23, %r8
> - lea 23(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit24):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rcx
> - movdqu %xmm0, (%rdi)
> - mov %rcx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 23(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $24, %r8
> - lea 24(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit25):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rcx
> - movdqu %xmm0, (%rdi)
> - mov %rcx, 16(%rdi)
> - mov %dh, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 24(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $25, %r8
> - lea 25(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit26):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rdx
> - mov 24(%rsi), %cx
> - movdqu %xmm0, (%rdi)
> - mov %rdx, 16(%rdi)
> - mov %cx, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 25(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $26, %r8
> - lea 26(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit27):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rdx
> - mov 23(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %rdx, 16(%rdi)
> - mov %ecx, 23(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 26(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $27, %r8
> - lea 27(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit28):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rdx
> - mov 24(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %rdx, 16(%rdi)
> - mov %ecx, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 27(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $28, %r8
> - lea 28(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit29):
> - movdqu (%rsi), %xmm0
> - movdqu 13(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 13(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 28(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $29, %r8
> - lea 29(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit30):
> - movdqu (%rsi), %xmm0
> - movdqu 14(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 14(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 29(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $30, %r8
> - lea 30(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit31):
> - movdqu (%rsi), %xmm0
> - movdqu 15(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 30(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $31, %r8
> - lea 31(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Exit32):
> - movdqu (%rsi), %xmm0
> - movdqu 16(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 31(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> - sub $32, %r8
> - lea 32(%rdi), %rdi
> - jnz L(StrncpyFillTailWithZero)
> -# endif
> - ret
> -
> -# ifdef USE_AS_STRNCPY
> -
> - .p2align 4
> -L(StrncpyExit0):
> -# ifdef USE_AS_STPCPY
> - mov %rdi, %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, (%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit1):
> - mov (%rsi), %dl
> - mov %dl, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 1(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 1(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit2):
> - mov (%rsi), %dx
> - mov %dx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 2(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 2(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit3):
> - mov (%rsi), %cx
> - mov 2(%rsi), %dl
> - mov %cx, (%rdi)
> - mov %dl, 2(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 3(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 3(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit4):
> - mov (%rsi), %edx
> - mov %edx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 4(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 4(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit5):
> - mov (%rsi), %ecx
> - mov 4(%rsi), %dl
> - mov %ecx, (%rdi)
> - mov %dl, 4(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 5(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 5(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit6):
> - mov (%rsi), %ecx
> - mov 4(%rsi), %dx
> - mov %ecx, (%rdi)
> - mov %dx, 4(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 6(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 6(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit7):
> - mov (%rsi), %ecx
> - mov 3(%rsi), %edx
> - mov %ecx, (%rdi)
> - mov %edx, 3(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 7(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 7(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit8):
> - mov (%rsi), %rdx
> - mov %rdx, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 8(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 8(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit9):
> - mov (%rsi), %rcx
> - mov 8(%rsi), %dl
> - mov %rcx, (%rdi)
> - mov %dl, 8(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 9(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 9(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit10):
> - mov (%rsi), %rcx
> - mov 8(%rsi), %dx
> - mov %rcx, (%rdi)
> - mov %dx, 8(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 10(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 10(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit11):
> - mov (%rsi), %rcx
> - mov 7(%rsi), %edx
> - mov %rcx, (%rdi)
> - mov %edx, 7(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 11(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 11(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit12):
> - mov (%rsi), %rcx
> - mov 8(%rsi), %edx
> - mov %rcx, (%rdi)
> - mov %edx, 8(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 12(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 12(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit13):
> - mov (%rsi), %rcx
> - mov 5(%rsi), %rdx
> - mov %rcx, (%rdi)
> - mov %rdx, 5(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 13(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 13(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit14):
> - mov (%rsi), %rcx
> - mov 6(%rsi), %rdx
> - mov %rcx, (%rdi)
> - mov %rdx, 6(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 14(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 14(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit15):
> - mov (%rsi), %rcx
> - mov 7(%rsi), %rdx
> - mov %rcx, (%rdi)
> - mov %rdx, 7(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 15(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 15(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit16):
> - movdqu (%rsi), %xmm0
> - movdqu %xmm0, (%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 16(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 16(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit17):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %cl
> - movdqu %xmm0, (%rdi)
> - mov %cl, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 17(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 17(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit18):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %cx
> - movdqu %xmm0, (%rdi)
> - mov %cx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 18(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 18(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit19):
> - movdqu (%rsi), %xmm0
> - mov 15(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %ecx, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 19(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 19(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit20):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %ecx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 20(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 20(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit21):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %ecx
> - mov 20(%rsi), %dl
> - movdqu %xmm0, (%rdi)
> - mov %ecx, 16(%rdi)
> - mov %dl, 20(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 21(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 21(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit22):
> - movdqu (%rsi), %xmm0
> - mov 14(%rsi), %rcx
> - movdqu %xmm0, (%rdi)
> - mov %rcx, 14(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 22(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 22(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit23):
> - movdqu (%rsi), %xmm0
> - mov 15(%rsi), %rcx
> - movdqu %xmm0, (%rdi)
> - mov %rcx, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 23(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 23(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit24):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rcx
> - movdqu %xmm0, (%rdi)
> - mov %rcx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 24(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 24(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit25):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rdx
> - mov 24(%rsi), %cl
> - movdqu %xmm0, (%rdi)
> - mov %rdx, 16(%rdi)
> - mov %cl, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 25(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 25(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit26):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rdx
> - mov 24(%rsi), %cx
> - movdqu %xmm0, (%rdi)
> - mov %rdx, 16(%rdi)
> - mov %cx, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 26(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 26(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit27):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rdx
> - mov 23(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %rdx, 16(%rdi)
> - mov %ecx, 23(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 27(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 27(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit28):
> - movdqu (%rsi), %xmm0
> - mov 16(%rsi), %rdx
> - mov 24(%rsi), %ecx
> - movdqu %xmm0, (%rdi)
> - mov %rdx, 16(%rdi)
> - mov %ecx, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 28(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 28(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit29):
> - movdqu (%rsi), %xmm0
> - movdqu 13(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 13(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 29(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 29(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit30):
> - movdqu (%rsi), %xmm0
> - movdqu 14(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 14(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 30(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 30(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit31):
> - movdqu (%rsi), %xmm0
> - movdqu 15(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 31(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 31(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit32):
> - movdqu (%rsi), %xmm0
> - movdqu 16(%rsi), %xmm2
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 32(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 32(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(StrncpyExit33):
> - movdqu (%rsi), %xmm0
> - movdqu 16(%rsi), %xmm2
> - mov 32(%rsi), %cl
> - movdqu %xmm0, (%rdi)
> - movdqu %xmm2, 16(%rdi)
> - mov %cl, 32(%rdi)
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 33(%rdi)
> -# endif
> - ret
> -
> -# ifndef USE_AS_STRCAT
> -
> - .p2align 4
> -L(Fill0):
> - ret
> -
> - .p2align 4
> -L(Fill1):
> - mov %dl, (%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill2):
> - mov %dx, (%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill3):
> - mov %edx, -1(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill4):
> - mov %edx, (%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill5):
> - mov %edx, (%rdi)
> - mov %dl, 4(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill6):
> - mov %edx, (%rdi)
> - mov %dx, 4(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill7):
> - mov %rdx, -1(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill8):
> - mov %rdx, (%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill9):
> - mov %rdx, (%rdi)
> - mov %dl, 8(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill10):
> - mov %rdx, (%rdi)
> - mov %dx, 8(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill11):
> - mov %rdx, (%rdi)
> - mov %edx, 7(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill12):
> - mov %rdx, (%rdi)
> - mov %edx, 8(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill13):
> - mov %rdx, (%rdi)
> - mov %rdx, 5(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill14):
> - mov %rdx, (%rdi)
> - mov %rdx, 6(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill15):
> - movdqu %xmm0, -1(%rdi)
> - ret
> -
> - .p2align 4
> -L(Fill16):
> - movdqu %xmm0, (%rdi)
> - ret
> -
> - .p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm2):
> - movdqu %xmm2, (%rdi, %rcx)
> -
> - .p2align 4
> -L(CopyFrom1To16BytesXmmExit):
> - bsf %rdx, %rdx
> - add $15, %r8
> - add %rcx, %rdi
> -# ifdef USE_AS_STPCPY
> - lea (%rdi, %rdx), %rax
> -# endif
> - sub %rdx, %r8
> - lea 1(%rdi, %rdx), %rdi
> -
> - .p2align 4
> -L(StrncpyFillTailWithZero):
> - pxor %xmm0, %xmm0
> - xor %rdx, %rdx
> - sub $16, %r8
> - jbe L(StrncpyFillExit)
> -
> - movdqu %xmm0, (%rdi)
> - add $16, %rdi
> -
> - mov %rdi, %rsi
> - and $0xf, %rsi
> - sub %rsi, %rdi
> - add %rsi, %r8
> - sub $64, %r8
> - jb L(StrncpyFillLess64)
> -
> -L(StrncpyFillLoopMovdqa):
> - movdqa %xmm0, (%rdi)
> - movdqa %xmm0, 16(%rdi)
> - movdqa %xmm0, 32(%rdi)
> - movdqa %xmm0, 48(%rdi)
> - add $64, %rdi
> - sub $64, %r8
> - jae L(StrncpyFillLoopMovdqa)
> -
> -L(StrncpyFillLess64):
> - add $32, %r8
> - jl L(StrncpyFillLess32)
> - movdqa %xmm0, (%rdi)
> - movdqa %xmm0, 16(%rdi)
> - add $32, %rdi
> - sub $16, %r8
> - jl L(StrncpyFillExit)
> - movdqa %xmm0, (%rdi)
> - add $16, %rdi
> - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> -
> -L(StrncpyFillLess32):
> - add $16, %r8
> - jl L(StrncpyFillExit)
> - movdqa %xmm0, (%rdi)
> - add $16, %rdi
> - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> -
> -L(StrncpyFillExit):
> - add $16, %r8
> - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> -
> -/* end of ifndef USE_AS_STRCAT */
> -# endif
> -
> - .p2align 4
> -L(UnalignedLeaveCase2OrCase3):
> - test %rdx, %rdx
> - jnz L(Unaligned64LeaveCase2)
> -L(Unaligned64LeaveCase3):
> - lea 64(%r8), %rcx
> - and $-16, %rcx
> - add $48, %r8
> - jl L(CopyFrom1To16BytesCase3)
> - movdqu %xmm4, (%rdi)
> - sub $16, %r8
> - jb L(CopyFrom1To16BytesCase3)
> - movdqu %xmm5, 16(%rdi)
> - sub $16, %r8
> - jb L(CopyFrom1To16BytesCase3)
> - movdqu %xmm6, 32(%rdi)
> - sub $16, %r8
> - jb L(CopyFrom1To16BytesCase3)
> - movdqu %xmm7, 48(%rdi)
> -# ifdef USE_AS_STPCPY
> - lea 64(%rdi), %rax
> -# endif
> -# ifdef USE_AS_STRCAT
> - xor %ch, %ch
> - movb %ch, 64(%rdi)
> -# endif
> - ret
> -
> - .p2align 4
> -L(Unaligned64LeaveCase2):
> - xor %rcx, %rcx
> - pcmpeqb %xmm4, %xmm0
> - pmovmskb %xmm0, %rdx
> - add $48, %r8
> - jle L(CopyFrom1To16BytesCase2OrCase3)
> - test %rdx, %rdx
> -# ifndef USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm4)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> - pcmpeqb %xmm5, %xmm0
> - pmovmskb %xmm0, %rdx
> - movdqu %xmm4, (%rdi)
> - add $16, %rcx
> - sub $16, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> - test %rdx, %rdx
> -# ifndef USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm5)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - pcmpeqb %xmm6, %xmm0
> - pmovmskb %xmm0, %rdx
> - movdqu %xmm5, 16(%rdi)
> - add $16, %rcx
> - sub $16, %r8
> - jbe L(CopyFrom1To16BytesCase2OrCase3)
> - test %rdx, %rdx
> -# ifndef USE_AS_STRCAT
> - jnz L(CopyFrom1To16BytesUnalignedXmm6)
> -# else
> - jnz L(CopyFrom1To16Bytes)
> -# endif
> -
> - pcmpeqb %xmm7, %xmm0
> - pmovmskb %xmm0, %rdx
> - movdqu %xmm6, 32(%rdi)
> - lea 16(%rdi, %rcx), %rdi
> - lea 16(%rsi, %rcx), %rsi
> - bsf %rdx, %rdx
> - cmp %r8, %rdx
> - jb L(CopyFrom1To16BytesExit)
> - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> - .p2align 4
> -L(ExitZero):
> -# ifndef USE_AS_STRCAT
> - mov %rdi, %rax
> -# endif
> - ret
> -
> -# endif
> -
> -# ifndef USE_AS_STRCAT
> -END (STRCPY)
> -# else
> -END (STRCAT)
> -# endif
> - .p2align 4
> - .section .rodata
> -L(ExitTable):
> - .int JMPTBL(L(Exit1), L(ExitTable))
> - .int JMPTBL(L(Exit2), L(ExitTable))
> - .int JMPTBL(L(Exit3), L(ExitTable))
> - .int JMPTBL(L(Exit4), L(ExitTable))
> - .int JMPTBL(L(Exit5), L(ExitTable))
> - .int JMPTBL(L(Exit6), L(ExitTable))
> - .int JMPTBL(L(Exit7), L(ExitTable))
> - .int JMPTBL(L(Exit8), L(ExitTable))
> - .int JMPTBL(L(Exit9), L(ExitTable))
> - .int JMPTBL(L(Exit10), L(ExitTable))
> - .int JMPTBL(L(Exit11), L(ExitTable))
> - .int JMPTBL(L(Exit12), L(ExitTable))
> - .int JMPTBL(L(Exit13), L(ExitTable))
> - .int JMPTBL(L(Exit14), L(ExitTable))
> - .int JMPTBL(L(Exit15), L(ExitTable))
> - .int JMPTBL(L(Exit16), L(ExitTable))
> - .int JMPTBL(L(Exit17), L(ExitTable))
> - .int JMPTBL(L(Exit18), L(ExitTable))
> - .int JMPTBL(L(Exit19), L(ExitTable))
> - .int JMPTBL(L(Exit20), L(ExitTable))
> - .int JMPTBL(L(Exit21), L(ExitTable))
> - .int JMPTBL(L(Exit22), L(ExitTable))
> - .int JMPTBL(L(Exit23), L(ExitTable))
> - .int JMPTBL(L(Exit24), L(ExitTable))
> - .int JMPTBL(L(Exit25), L(ExitTable))
> - .int JMPTBL(L(Exit26), L(ExitTable))
> - .int JMPTBL(L(Exit27), L(ExitTable))
> - .int JMPTBL(L(Exit28), L(ExitTable))
> - .int JMPTBL(L(Exit29), L(ExitTable))
> - .int JMPTBL(L(Exit30), L(ExitTable))
> - .int JMPTBL(L(Exit31), L(ExitTable))
> - .int JMPTBL(L(Exit32), L(ExitTable))
> -# ifdef USE_AS_STRNCPY
> -L(ExitStrncpyTable):
> - .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> - .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> -# ifndef USE_AS_STRCAT
> - .p2align 4
> -L(FillTable):
> - .int JMPTBL(L(Fill0), L(FillTable))
> - .int JMPTBL(L(Fill1), L(FillTable))
> - .int JMPTBL(L(Fill2), L(FillTable))
> - .int JMPTBL(L(Fill3), L(FillTable))
> - .int JMPTBL(L(Fill4), L(FillTable))
> - .int JMPTBL(L(Fill5), L(FillTable))
> - .int JMPTBL(L(Fill6), L(FillTable))
> - .int JMPTBL(L(Fill7), L(FillTable))
> - .int JMPTBL(L(Fill8), L(FillTable))
> - .int JMPTBL(L(Fill9), L(FillTable))
> - .int JMPTBL(L(Fill10), L(FillTable))
> - .int JMPTBL(L(Fill11), L(FillTable))
> - .int JMPTBL(L(Fill12), L(FillTable))
> - .int JMPTBL(L(Fill13), L(FillTable))
> - .int JMPTBL(L(Fill14), L(FillTable))
> - .int JMPTBL(L(Fill15), L(FillTable))
> - .int JMPTBL(L(Fill16), L(FillTable))
> -# endif
> -# endif
> -#endif
> +#define AS_STRCPY
> +#define STPCPY __strcpy_sse2_unaligned
> +#include "stpcpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S
> index 9464ee8..92be04c 100644
> --- a/sysdeps/x86_64/multiarch/strcpy.S
> +++ b/sysdeps/x86_64/multiarch/strcpy.S
> @@ -28,31 +28,18 @@
> #endif
>
> #ifdef USE_AS_STPCPY
> -# ifdef USE_AS_STRNCPY
> -# define STRCPY_SSSE3 __stpncpy_ssse3
> -# define STRCPY_SSE2 __stpncpy_sse2
> -# define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> -# define __GI_STRCPY __GI_stpncpy
> -# define __GI___STRCPY __GI___stpncpy
> -# else
> # define STRCPY_SSSE3 __stpcpy_ssse3
> # define STRCPY_SSE2 __stpcpy_sse2
> +# define STRCPY_AVX2 __stpcpy_avx2
> # define STRCPY_SSE2_UNALIGNED __stpcpy_sse2_unaligned
> # define __GI_STRCPY __GI_stpcpy
> # define __GI___STRCPY __GI___stpcpy
> -# endif
> #else
> -# ifdef USE_AS_STRNCPY
> -# define STRCPY_SSSE3 __strncpy_ssse3
> -# define STRCPY_SSE2 __strncpy_sse2
> -# define STRCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned
> -# define __GI_STRCPY __GI_strncpy
> -# else
> # define STRCPY_SSSE3 __strcpy_ssse3
> +# define STRCPY_AVX2 __strcpy_avx2
> # define STRCPY_SSE2 __strcpy_sse2
> # define STRCPY_SSE2_UNALIGNED __strcpy_sse2_unaligned
> # define __GI_STRCPY __GI_strcpy
> -# endif
> #endif
>
>
> @@ -64,7 +51,10 @@ ENTRY(STRCPY)
> cmpl $0, __cpu_features+KIND_OFFSET(%rip)
> jne 1f
> call __init_cpu_features
> -1: leaq STRCPY_SSE2_UNALIGNED(%rip), %rax
> +1: leaq STRCPY_AVX2(%rip), %rax
> + testl $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip)
> + jnz 2f
> + leaq STRCPY_SSE2_UNALIGNED(%rip), %rax
> testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
> jnz 2f
> leaq STRCPY_SSE2(%rip), %rax
> diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> index fcc23a7..e4c98e7 100644
> --- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> @@ -1,3 +1,1888 @@
> -#define USE_AS_STRNCPY
> -#define STRCPY __strncpy_sse2_unaligned
> -#include "strcpy-sse2-unaligned.S"
> +/* strcpy with SSE2 and unaligned load
> + Copyright (C) 2011-2015 Free Software Foundation, Inc.
> + Contributed by Intel Corporation.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#if IS_IN (libc)
> +
> +# ifndef USE_AS_STRCAT
> +# include <sysdep.h>
> +
> +# ifndef STRCPY
> +# define STRCPY __strncpy_sse2_unaligned
> +# endif
> +
> +# define USE_AS_STRNCPY
> +# endif
> +
> +# define JMPTBL(I, B) I - B
> +# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \
> + lea TABLE(%rip), %r11; \
> + movslq (%r11, INDEX, SCALE), %rcx; \
> + lea (%r11, %rcx), %rcx; \
> + jmp *%rcx
> +
> +# ifndef USE_AS_STRCAT
> +
> +.text
> +ENTRY (STRCPY)
> +# ifdef USE_AS_STRNCPY
> + mov %rdx, %r8
> + test %r8, %r8
> + jz L(ExitZero)
> +# endif
> + mov %rsi, %rcx
> +# ifndef USE_AS_STPCPY
> + mov %rdi, %rax /* save result */
> +# endif
> +
> +# endif
> +
> + and $63, %rcx
> + cmp $32, %rcx
> + jbe L(SourceStringAlignmentLess32)
> +
> + and $-16, %rsi
> + and $15, %rcx
> + pxor %xmm0, %xmm0
> + pxor %xmm1, %xmm1
> +
> + pcmpeqb (%rsi), %xmm1
> + pmovmskb %xmm1, %rdx
> + shr %cl, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> + mov $16, %r10
> + sub %rcx, %r10
> + cmp %r10, %r8
> +# else
> + mov $17, %r10
> + sub %rcx, %r10
> + cmp %r10, %r8
> +# endif
> + jbe L(CopyFrom1To16BytesTailCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> + jnz L(CopyFrom1To16BytesTail)
> +
> + pcmpeqb 16(%rsi), %xmm0
> + pmovmskb %xmm0, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> + add $16, %r10
> + cmp %r10, %r8
> + jbe L(CopyFrom1To32BytesCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> + jnz L(CopyFrom1To32Bytes)
> +
> + movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */
> + movdqu %xmm1, (%rdi)
> +
> +/* If source address alignment != destination address alignment */
> + .p2align 4
> +L(Unalign16Both):
> + sub %rcx, %rdi
> +# ifdef USE_AS_STRNCPY
> + add %rcx, %r8
> +# endif
> + mov $16, %rcx
> + movdqa (%rsi, %rcx), %xmm1
> + movaps 16(%rsi, %rcx), %xmm2
> + movdqu %xmm1, (%rdi, %rcx)
> + pcmpeqb %xmm2, %xmm0
> + pmovmskb %xmm0, %rdx
> + add $16, %rcx
> +# ifdef USE_AS_STRNCPY
> + sub $48, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm2)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + movaps 16(%rsi, %rcx), %xmm3
> + movdqu %xmm2, (%rdi, %rcx)
> + pcmpeqb %xmm3, %xmm0
> + pmovmskb %xmm0, %rdx
> + add $16, %rcx
> +# ifdef USE_AS_STRNCPY
> + sub $16, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm3)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + movaps 16(%rsi, %rcx), %xmm4
> + movdqu %xmm3, (%rdi, %rcx)
> + pcmpeqb %xmm4, %xmm0
> + pmovmskb %xmm0, %rdx
> + add $16, %rcx
> +# ifdef USE_AS_STRNCPY
> + sub $16, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm4)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + movaps 16(%rsi, %rcx), %xmm1
> + movdqu %xmm4, (%rdi, %rcx)
> + pcmpeqb %xmm1, %xmm0
> + pmovmskb %xmm0, %rdx
> + add $16, %rcx
> +# ifdef USE_AS_STRNCPY
> + sub $16, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm1)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + movaps 16(%rsi, %rcx), %xmm2
> + movdqu %xmm1, (%rdi, %rcx)
> + pcmpeqb %xmm2, %xmm0
> + pmovmskb %xmm0, %rdx
> + add $16, %rcx
> +# ifdef USE_AS_STRNCPY
> + sub $16, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm2)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + movaps 16(%rsi, %rcx), %xmm3
> + movdqu %xmm2, (%rdi, %rcx)
> + pcmpeqb %xmm3, %xmm0
> + pmovmskb %xmm0, %rdx
> + add $16, %rcx
> +# ifdef USE_AS_STRNCPY
> + sub $16, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm3)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + movdqu %xmm3, (%rdi, %rcx)
> + mov %rsi, %rdx
> + lea 16(%rsi, %rcx), %rsi
> + and $-0x40, %rsi
> + sub %rsi, %rdx
> + sub %rdx, %rdi
> +# ifdef USE_AS_STRNCPY
> + lea 128(%r8, %rdx), %r8
> +# endif
> +L(Unaligned64Loop):
> + movaps (%rsi), %xmm2
> + movaps %xmm2, %xmm4
> + movaps 16(%rsi), %xmm5
> + movaps 32(%rsi), %xmm3
> + movaps %xmm3, %xmm6
> + movaps 48(%rsi), %xmm7
> + pminub %xmm5, %xmm2
> + pminub %xmm7, %xmm3
> + pminub %xmm2, %xmm3
> + pcmpeqb %xmm0, %xmm3
> + pmovmskb %xmm3, %rdx
> +# ifdef USE_AS_STRNCPY
> + sub $64, %r8
> + jbe L(UnalignedLeaveCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> + jnz L(Unaligned64Leave)
> +
> +L(Unaligned64Loop_start):
> + add $64, %rdi
> + add $64, %rsi
> + movdqu %xmm4, -64(%rdi)
> + movaps (%rsi), %xmm2
> + movdqa %xmm2, %xmm4
> + movdqu %xmm5, -48(%rdi)
> + movaps 16(%rsi), %xmm5
> + pminub %xmm5, %xmm2
> + movaps 32(%rsi), %xmm3
> + movdqu %xmm6, -32(%rdi)
> + movaps %xmm3, %xmm6
> + movdqu %xmm7, -16(%rdi)
> + movaps 48(%rsi), %xmm7
> + pminub %xmm7, %xmm3
> + pminub %xmm2, %xmm3
> + pcmpeqb %xmm0, %xmm3
> + pmovmskb %xmm3, %rdx
> +# ifdef USE_AS_STRNCPY
> + sub $64, %r8
> + jbe L(UnalignedLeaveCase2OrCase3)
> +# endif
> + test %rdx, %rdx
> + jz L(Unaligned64Loop_start)
> +
> +L(Unaligned64Leave):
> + pxor %xmm1, %xmm1
> +
> + pcmpeqb %xmm4, %xmm0
> + pcmpeqb %xmm5, %xmm1
> + pmovmskb %xmm0, %rdx
> + pmovmskb %xmm1, %rcx
> + test %rdx, %rdx
> + jnz L(CopyFrom1To16BytesUnaligned_0)
> + test %rcx, %rcx
> + jnz L(CopyFrom1To16BytesUnaligned_16)
> +
> + pcmpeqb %xmm6, %xmm0
> + pcmpeqb %xmm7, %xmm1
> + pmovmskb %xmm0, %rdx
> + pmovmskb %xmm1, %rcx
> + test %rdx, %rdx
> + jnz L(CopyFrom1To16BytesUnaligned_32)
> +
> + bsf %rcx, %rdx
> + movdqu %xmm4, (%rdi)
> + movdqu %xmm5, 16(%rdi)
> + movdqu %xmm6, 32(%rdi)
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> + lea 48(%rdi, %rdx), %rax
> +# endif
> + movdqu %xmm7, 48(%rdi)
> + add $15, %r8
> + sub %rdx, %r8
> + lea 49(%rdi, %rdx), %rdi
> + jmp L(StrncpyFillTailWithZero)
> +# else
> + add $48, %rsi
> + add $48, %rdi
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> +/* If source address alignment == destination address alignment */
> +
> +L(SourceStringAlignmentLess32):
> + pxor %xmm0, %xmm0
> + movdqu (%rsi), %xmm1
> + movdqu 16(%rsi), %xmm2
> + pcmpeqb %xmm1, %xmm0
> + pmovmskb %xmm0, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> + cmp $16, %r8
> +# else
> + cmp $17, %r8
> +# endif
> + jbe L(CopyFrom1To16BytesTail1Case2OrCase3)
> +# endif
> + test %rdx, %rdx
> + jnz L(CopyFrom1To16BytesTail1)
> +
> + pcmpeqb %xmm2, %xmm0
> + movdqu %xmm1, (%rdi)
> + pmovmskb %xmm0, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> + cmp $32, %r8
> +# else
> + cmp $33, %r8
> +# endif
> + jbe L(CopyFrom1To32Bytes1Case2OrCase3)
> +# endif
> + test %rdx, %rdx
> + jnz L(CopyFrom1To32Bytes1)
> +
> + and $-16, %rsi
> + and $15, %rcx
> + jmp L(Unalign16Both)
> +
> +/*------End of main part with loops---------------------*/
> +
> +/* Case1 */
> +
> +# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> + .p2align 4
> +L(CopyFrom1To16Bytes):
> + add %rcx, %rdi
> + add %rcx, %rsi
> + bsf %rdx, %rdx
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> + .p2align 4
> +L(CopyFrom1To16BytesTail):
> + add %rcx, %rsi
> + bsf %rdx, %rdx
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> + .p2align 4
> +L(CopyFrom1To32Bytes1):
> + add $16, %rsi
> + add $16, %rdi
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $16, %r8
> +# endif
> +L(CopyFrom1To16BytesTail1):
> + bsf %rdx, %rdx
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> + .p2align 4
> +L(CopyFrom1To32Bytes):
> + bsf %rdx, %rdx
> + add %rcx, %rsi
> + add $16, %rdx
> + sub %rcx, %rdx
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnaligned_0):
> + bsf %rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> + lea (%rdi, %rdx), %rax
> +# endif
> + movdqu %xmm4, (%rdi)
> + add $63, %r8
> + sub %rdx, %r8
> + lea 1(%rdi, %rdx), %rdi
> + jmp L(StrncpyFillTailWithZero)
> +# else
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnaligned_16):
> + bsf %rcx, %rdx
> + movdqu %xmm4, (%rdi)
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> + lea 16(%rdi, %rdx), %rax
> +# endif
> + movdqu %xmm5, 16(%rdi)
> + add $47, %r8
> + sub %rdx, %r8
> + lea 17(%rdi, %rdx), %rdi
> + jmp L(StrncpyFillTailWithZero)
> +# else
> + add $16, %rsi
> + add $16, %rdi
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnaligned_32):
> + bsf %rdx, %rdx
> + movdqu %xmm4, (%rdi)
> + movdqu %xmm5, 16(%rdi)
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> + lea 32(%rdi, %rdx), %rax
> +# endif
> + movdqu %xmm6, 32(%rdi)
> + add $31, %r8
> + sub %rdx, %r8
> + lea 33(%rdi, %rdx), %rdi
> + jmp L(StrncpyFillTailWithZero)
> +# else
> + add $32, %rsi
> + add $32, %rdi
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> +# ifdef USE_AS_STRNCPY
> +# ifndef USE_AS_STRCAT
> + .p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm6):
> + movdqu %xmm6, (%rdi, %rcx)
> + jmp L(CopyFrom1To16BytesXmmExit)
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm5):
> + movdqu %xmm5, (%rdi, %rcx)
> + jmp L(CopyFrom1To16BytesXmmExit)
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm4):
> + movdqu %xmm4, (%rdi, %rcx)
> + jmp L(CopyFrom1To16BytesXmmExit)
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm3):
> + movdqu %xmm3, (%rdi, %rcx)
> + jmp L(CopyFrom1To16BytesXmmExit)
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm1):
> + movdqu %xmm1, (%rdi, %rcx)
> + jmp L(CopyFrom1To16BytesXmmExit)
> +# endif
> +
> + .p2align 4
> +L(CopyFrom1To16BytesExit):
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> +/* Case2 */
> +
> + .p2align 4
> +L(CopyFrom1To16BytesCase2):
> + add $16, %r8
> + add %rcx, %rdi
> + add %rcx, %rsi
> + bsf %rdx, %rdx
> + cmp %r8, %rdx
> + jb L(CopyFrom1To16BytesExit)
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> + .p2align 4
> +L(CopyFrom1To32BytesCase2):
> + add %rcx, %rsi
> + bsf %rdx, %rdx
> + add $16, %rdx
> + sub %rcx, %rdx
> + cmp %r8, %rdx
> + jb L(CopyFrom1To16BytesExit)
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +L(CopyFrom1To16BytesTailCase2):
> + add %rcx, %rsi
> + bsf %rdx, %rdx
> + cmp %r8, %rdx
> + jb L(CopyFrom1To16BytesExit)
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +L(CopyFrom1To16BytesTail1Case2):
> + bsf %rdx, %rdx
> + cmp %r8, %rdx
> + jb L(CopyFrom1To16BytesExit)
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +/* Case2 or Case3, Case3 */
> +
> + .p2align 4
> +L(CopyFrom1To16BytesCase2OrCase3):
> + test %rdx, %rdx
> + jnz L(CopyFrom1To16BytesCase2)
> +L(CopyFrom1To16BytesCase3):
> + add $16, %r8
> + add %rcx, %rdi
> + add %rcx, %rsi
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> + .p2align 4
> +L(CopyFrom1To32BytesCase2OrCase3):
> + test %rdx, %rdx
> + jnz L(CopyFrom1To32BytesCase2)
> + add %rcx, %rsi
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> + .p2align 4
> +L(CopyFrom1To16BytesTailCase2OrCase3):
> + test %rdx, %rdx
> + jnz L(CopyFrom1To16BytesTailCase2)
> + add %rcx, %rsi
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> + .p2align 4
> +L(CopyFrom1To32Bytes1Case2OrCase3):
> + add $16, %rdi
> + add $16, %rsi
> + sub $16, %r8
> +L(CopyFrom1To16BytesTail1Case2OrCase3):
> + test %rdx, %rdx
> + jnz L(CopyFrom1To16BytesTail1Case2)
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +# endif
> +
> +/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> +
> + .p2align 4
> +L(Exit1):
> + mov %dh, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea (%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $1, %r8
> + lea 1(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit2):
> + mov (%rsi), %dx
> + mov %dx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 1(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $2, %r8
> + lea 2(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit3):
> + mov (%rsi), %cx
> + mov %cx, (%rdi)
> + mov %dh, 2(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 2(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $3, %r8
> + lea 3(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit4):
> + mov (%rsi), %edx
> + mov %edx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 3(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $4, %r8
> + lea 4(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit5):
> + mov (%rsi), %ecx
> + mov %dh, 4(%rdi)
> + mov %ecx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 4(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $5, %r8
> + lea 5(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit6):
> + mov (%rsi), %ecx
> + mov 4(%rsi), %dx
> + mov %ecx, (%rdi)
> + mov %dx, 4(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 5(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $6, %r8
> + lea 6(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit7):
> + mov (%rsi), %ecx
> + mov 3(%rsi), %edx
> + mov %ecx, (%rdi)
> + mov %edx, 3(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 6(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $7, %r8
> + lea 7(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit8):
> + mov (%rsi), %rdx
> + mov %rdx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 7(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $8, %r8
> + lea 8(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit9):
> + mov (%rsi), %rcx
> + mov %dh, 8(%rdi)
> + mov %rcx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 8(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $9, %r8
> + lea 9(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit10):
> + mov (%rsi), %rcx
> + mov 8(%rsi), %dx
> + mov %rcx, (%rdi)
> + mov %dx, 8(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 9(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $10, %r8
> + lea 10(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit11):
> + mov (%rsi), %rcx
> + mov 7(%rsi), %edx
> + mov %rcx, (%rdi)
> + mov %edx, 7(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 10(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $11, %r8
> + lea 11(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit12):
> + mov (%rsi), %rcx
> + mov 8(%rsi), %edx
> + mov %rcx, (%rdi)
> + mov %edx, 8(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 11(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $12, %r8
> + lea 12(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit13):
> + mov (%rsi), %rcx
> + mov 5(%rsi), %rdx
> + mov %rcx, (%rdi)
> + mov %rdx, 5(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 12(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $13, %r8
> + lea 13(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit14):
> + mov (%rsi), %rcx
> + mov 6(%rsi), %rdx
> + mov %rcx, (%rdi)
> + mov %rdx, 6(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 13(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $14, %r8
> + lea 14(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit15):
> + mov (%rsi), %rcx
> + mov 7(%rsi), %rdx
> + mov %rcx, (%rdi)
> + mov %rdx, 7(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 14(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $15, %r8
> + lea 15(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit16):
> + movdqu (%rsi), %xmm0
> + movdqu %xmm0, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 15(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $16, %r8
> + lea 16(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit17):
> + movdqu (%rsi), %xmm0
> + movdqu %xmm0, (%rdi)
> + mov %dh, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 16(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $17, %r8
> + lea 17(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit18):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %cx
> + movdqu %xmm0, (%rdi)
> + mov %cx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 17(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $18, %r8
> + lea 18(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit19):
> + movdqu (%rsi), %xmm0
> + mov 15(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %ecx, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 18(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $19, %r8
> + lea 19(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit20):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %ecx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 19(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $20, %r8
> + lea 20(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit21):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %ecx, 16(%rdi)
> + mov %dh, 20(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 20(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $21, %r8
> + lea 21(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit22):
> + movdqu (%rsi), %xmm0
> + mov 14(%rsi), %rcx
> + movdqu %xmm0, (%rdi)
> + mov %rcx, 14(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 21(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $22, %r8
> + lea 22(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit23):
> + movdqu (%rsi), %xmm0
> + mov 15(%rsi), %rcx
> + movdqu %xmm0, (%rdi)
> + mov %rcx, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 22(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $23, %r8
> + lea 23(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit24):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rcx
> + movdqu %xmm0, (%rdi)
> + mov %rcx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 23(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $24, %r8
> + lea 24(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit25):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rcx
> + movdqu %xmm0, (%rdi)
> + mov %rcx, 16(%rdi)
> + mov %dh, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 24(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $25, %r8
> + lea 25(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit26):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rdx
> + mov 24(%rsi), %cx
> + movdqu %xmm0, (%rdi)
> + mov %rdx, 16(%rdi)
> + mov %cx, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 25(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $26, %r8
> + lea 26(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit27):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rdx
> + mov 23(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %rdx, 16(%rdi)
> + mov %ecx, 23(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 26(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $27, %r8
> + lea 27(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit28):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rdx
> + mov 24(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %rdx, 16(%rdi)
> + mov %ecx, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 27(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $28, %r8
> + lea 28(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit29):
> + movdqu (%rsi), %xmm0
> + movdqu 13(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 13(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 28(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $29, %r8
> + lea 29(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit30):
> + movdqu (%rsi), %xmm0
> + movdqu 14(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 14(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 29(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $30, %r8
> + lea 30(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit31):
> + movdqu (%rsi), %xmm0
> + movdqu 15(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 30(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $31, %r8
> + lea 31(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Exit32):
> + movdqu (%rsi), %xmm0
> + movdqu 16(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 31(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> + sub $32, %r8
> + lea 32(%rdi), %rdi
> + jnz L(StrncpyFillTailWithZero)
> +# endif
> + ret
> +
> +# ifdef USE_AS_STRNCPY
> +
> + .p2align 4
> +L(StrncpyExit0):
> +# ifdef USE_AS_STPCPY
> + mov %rdi, %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, (%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit1):
> + mov (%rsi), %dl
> + mov %dl, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 1(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 1(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit2):
> + mov (%rsi), %dx
> + mov %dx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 2(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 2(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit3):
> + mov (%rsi), %cx
> + mov 2(%rsi), %dl
> + mov %cx, (%rdi)
> + mov %dl, 2(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 3(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 3(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit4):
> + mov (%rsi), %edx
> + mov %edx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 4(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 4(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit5):
> + mov (%rsi), %ecx
> + mov 4(%rsi), %dl
> + mov %ecx, (%rdi)
> + mov %dl, 4(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 5(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 5(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit6):
> + mov (%rsi), %ecx
> + mov 4(%rsi), %dx
> + mov %ecx, (%rdi)
> + mov %dx, 4(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 6(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 6(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit7):
> + mov (%rsi), %ecx
> + mov 3(%rsi), %edx
> + mov %ecx, (%rdi)
> + mov %edx, 3(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 7(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 7(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit8):
> + mov (%rsi), %rdx
> + mov %rdx, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 8(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 8(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit9):
> + mov (%rsi), %rcx
> + mov 8(%rsi), %dl
> + mov %rcx, (%rdi)
> + mov %dl, 8(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 9(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 9(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit10):
> + mov (%rsi), %rcx
> + mov 8(%rsi), %dx
> + mov %rcx, (%rdi)
> + mov %dx, 8(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 10(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 10(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit11):
> + mov (%rsi), %rcx
> + mov 7(%rsi), %edx
> + mov %rcx, (%rdi)
> + mov %edx, 7(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 11(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 11(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit12):
> + mov (%rsi), %rcx
> + mov 8(%rsi), %edx
> + mov %rcx, (%rdi)
> + mov %edx, 8(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 12(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 12(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit13):
> + mov (%rsi), %rcx
> + mov 5(%rsi), %rdx
> + mov %rcx, (%rdi)
> + mov %rdx, 5(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 13(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 13(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit14):
> + mov (%rsi), %rcx
> + mov 6(%rsi), %rdx
> + mov %rcx, (%rdi)
> + mov %rdx, 6(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 14(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 14(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit15):
> + mov (%rsi), %rcx
> + mov 7(%rsi), %rdx
> + mov %rcx, (%rdi)
> + mov %rdx, 7(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 15(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 15(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit16):
> + movdqu (%rsi), %xmm0
> + movdqu %xmm0, (%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 16(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 16(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit17):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %cl
> + movdqu %xmm0, (%rdi)
> + mov %cl, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 17(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 17(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit18):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %cx
> + movdqu %xmm0, (%rdi)
> + mov %cx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 18(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 18(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit19):
> + movdqu (%rsi), %xmm0
> + mov 15(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %ecx, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 19(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 19(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit20):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %ecx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 20(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 20(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit21):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %ecx
> + mov 20(%rsi), %dl
> + movdqu %xmm0, (%rdi)
> + mov %ecx, 16(%rdi)
> + mov %dl, 20(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 21(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 21(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit22):
> + movdqu (%rsi), %xmm0
> + mov 14(%rsi), %rcx
> + movdqu %xmm0, (%rdi)
> + mov %rcx, 14(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 22(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 22(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit23):
> + movdqu (%rsi), %xmm0
> + mov 15(%rsi), %rcx
> + movdqu %xmm0, (%rdi)
> + mov %rcx, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 23(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 23(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit24):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rcx
> + movdqu %xmm0, (%rdi)
> + mov %rcx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 24(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 24(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit25):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rdx
> + mov 24(%rsi), %cl
> + movdqu %xmm0, (%rdi)
> + mov %rdx, 16(%rdi)
> + mov %cl, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 25(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 25(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit26):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rdx
> + mov 24(%rsi), %cx
> + movdqu %xmm0, (%rdi)
> + mov %rdx, 16(%rdi)
> + mov %cx, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 26(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 26(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit27):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rdx
> + mov 23(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %rdx, 16(%rdi)
> + mov %ecx, 23(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 27(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 27(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit28):
> + movdqu (%rsi), %xmm0
> + mov 16(%rsi), %rdx
> + mov 24(%rsi), %ecx
> + movdqu %xmm0, (%rdi)
> + mov %rdx, 16(%rdi)
> + mov %ecx, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 28(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 28(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit29):
> + movdqu (%rsi), %xmm0
> + movdqu 13(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 13(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 29(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 29(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit30):
> + movdqu (%rsi), %xmm0
> + movdqu 14(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 14(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 30(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 30(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit31):
> + movdqu (%rsi), %xmm0
> + movdqu 15(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 31(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 31(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit32):
> + movdqu (%rsi), %xmm0
> + movdqu 16(%rsi), %xmm2
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 32(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 32(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(StrncpyExit33):
> + movdqu (%rsi), %xmm0
> + movdqu 16(%rsi), %xmm2
> + mov 32(%rsi), %cl
> + movdqu %xmm0, (%rdi)
> + movdqu %xmm2, 16(%rdi)
> + mov %cl, 32(%rdi)
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 33(%rdi)
> +# endif
> + ret
> +
> +# ifndef USE_AS_STRCAT
> +
> + .p2align 4
> +L(Fill0):
> + ret
> +
> + .p2align 4
> +L(Fill1):
> + mov %dl, (%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill2):
> + mov %dx, (%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill3):
> + mov %edx, -1(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill4):
> + mov %edx, (%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill5):
> + mov %edx, (%rdi)
> + mov %dl, 4(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill6):
> + mov %edx, (%rdi)
> + mov %dx, 4(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill7):
> + mov %rdx, -1(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill8):
> + mov %rdx, (%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill9):
> + mov %rdx, (%rdi)
> + mov %dl, 8(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill10):
> + mov %rdx, (%rdi)
> + mov %dx, 8(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill11):
> + mov %rdx, (%rdi)
> + mov %edx, 7(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill12):
> + mov %rdx, (%rdi)
> + mov %edx, 8(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill13):
> + mov %rdx, (%rdi)
> + mov %rdx, 5(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill14):
> + mov %rdx, (%rdi)
> + mov %rdx, 6(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill15):
> + movdqu %xmm0, -1(%rdi)
> + ret
> +
> + .p2align 4
> +L(Fill16):
> + movdqu %xmm0, (%rdi)
> + ret
> +
> + .p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm2):
> + movdqu %xmm2, (%rdi, %rcx)
> +
> + .p2align 4
> +L(CopyFrom1To16BytesXmmExit):
> + bsf %rdx, %rdx
> + add $15, %r8
> + add %rcx, %rdi
> +# ifdef USE_AS_STPCPY
> + lea (%rdi, %rdx), %rax
> +# endif
> + sub %rdx, %r8
> + lea 1(%rdi, %rdx), %rdi
> +
> + .p2align 4
> +L(StrncpyFillTailWithZero):
> + pxor %xmm0, %xmm0
> + xor %rdx, %rdx
> + sub $16, %r8
> + jbe L(StrncpyFillExit)
> +
> + movdqu %xmm0, (%rdi)
> + add $16, %rdi
> +
> + mov %rdi, %rsi
> + and $0xf, %rsi
> + sub %rsi, %rdi
> + add %rsi, %r8
> + sub $64, %r8
> + jb L(StrncpyFillLess64)
> +
> +L(StrncpyFillLoopMovdqa):
> + movdqa %xmm0, (%rdi)
> + movdqa %xmm0, 16(%rdi)
> + movdqa %xmm0, 32(%rdi)
> + movdqa %xmm0, 48(%rdi)
> + add $64, %rdi
> + sub $64, %r8
> + jae L(StrncpyFillLoopMovdqa)
> +
> +L(StrncpyFillLess64):
> + add $32, %r8
> + jl L(StrncpyFillLess32)
> + movdqa %xmm0, (%rdi)
> + movdqa %xmm0, 16(%rdi)
> + add $32, %rdi
> + sub $16, %r8
> + jl L(StrncpyFillExit)
> + movdqa %xmm0, (%rdi)
> + add $16, %rdi
> + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> +
> +L(StrncpyFillLess32):
> + add $16, %r8
> + jl L(StrncpyFillExit)
> + movdqa %xmm0, (%rdi)
> + add $16, %rdi
> + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> +
> +L(StrncpyFillExit):
> + add $16, %r8
> + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> +
> +/* end of ifndef USE_AS_STRCAT */
> +# endif
> +
> + .p2align 4
> +L(UnalignedLeaveCase2OrCase3):
> + test %rdx, %rdx
> + jnz L(Unaligned64LeaveCase2)
> +L(Unaligned64LeaveCase3):
> + lea 64(%r8), %rcx
> + and $-16, %rcx
> + add $48, %r8
> + jl L(CopyFrom1To16BytesCase3)
> + movdqu %xmm4, (%rdi)
> + sub $16, %r8
> + jb L(CopyFrom1To16BytesCase3)
> + movdqu %xmm5, 16(%rdi)
> + sub $16, %r8
> + jb L(CopyFrom1To16BytesCase3)
> + movdqu %xmm6, 32(%rdi)
> + sub $16, %r8
> + jb L(CopyFrom1To16BytesCase3)
> + movdqu %xmm7, 48(%rdi)
> +# ifdef USE_AS_STPCPY
> + lea 64(%rdi), %rax
> +# endif
> +# ifdef USE_AS_STRCAT
> + xor %ch, %ch
> + movb %ch, 64(%rdi)
> +# endif
> + ret
> +
> + .p2align 4
> +L(Unaligned64LeaveCase2):
> + xor %rcx, %rcx
> + pcmpeqb %xmm4, %xmm0
> + pmovmskb %xmm0, %rdx
> + add $48, %r8
> + jle L(CopyFrom1To16BytesCase2OrCase3)
> + test %rdx, %rdx
> +# ifndef USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm4)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> + pcmpeqb %xmm5, %xmm0
> + pmovmskb %xmm0, %rdx
> + movdqu %xmm4, (%rdi)
> + add $16, %rcx
> + sub $16, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> + test %rdx, %rdx
> +# ifndef USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm5)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + pcmpeqb %xmm6, %xmm0
> + pmovmskb %xmm0, %rdx
> + movdqu %xmm5, 16(%rdi)
> + add $16, %rcx
> + sub $16, %r8
> + jbe L(CopyFrom1To16BytesCase2OrCase3)
> + test %rdx, %rdx
> +# ifndef USE_AS_STRCAT
> + jnz L(CopyFrom1To16BytesUnalignedXmm6)
> +# else
> + jnz L(CopyFrom1To16Bytes)
> +# endif
> +
> + pcmpeqb %xmm7, %xmm0
> + pmovmskb %xmm0, %rdx
> + movdqu %xmm6, 32(%rdi)
> + lea 16(%rdi, %rcx), %rdi
> + lea 16(%rsi, %rcx), %rsi
> + bsf %rdx, %rdx
> + cmp %r8, %rdx
> + jb L(CopyFrom1To16BytesExit)
> + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> + .p2align 4
> +L(ExitZero):
> +# ifndef USE_AS_STRCAT
> + mov %rdi, %rax
> +# endif
> + ret
> +
> +# endif
> +
> +# ifndef USE_AS_STRCAT
> +END (STRCPY)
> +# else
> +END (STRCAT)
> +# endif
> + .p2align 4
> + .section .rodata
> +L(ExitTable):
> + .int JMPTBL(L(Exit1), L(ExitTable))
> + .int JMPTBL(L(Exit2), L(ExitTable))
> + .int JMPTBL(L(Exit3), L(ExitTable))
> + .int JMPTBL(L(Exit4), L(ExitTable))
> + .int JMPTBL(L(Exit5), L(ExitTable))
> + .int JMPTBL(L(Exit6), L(ExitTable))
> + .int JMPTBL(L(Exit7), L(ExitTable))
> + .int JMPTBL(L(Exit8), L(ExitTable))
> + .int JMPTBL(L(Exit9), L(ExitTable))
> + .int JMPTBL(L(Exit10), L(ExitTable))
> + .int JMPTBL(L(Exit11), L(ExitTable))
> + .int JMPTBL(L(Exit12), L(ExitTable))
> + .int JMPTBL(L(Exit13), L(ExitTable))
> + .int JMPTBL(L(Exit14), L(ExitTable))
> + .int JMPTBL(L(Exit15), L(ExitTable))
> + .int JMPTBL(L(Exit16), L(ExitTable))
> + .int JMPTBL(L(Exit17), L(ExitTable))
> + .int JMPTBL(L(Exit18), L(ExitTable))
> + .int JMPTBL(L(Exit19), L(ExitTable))
> + .int JMPTBL(L(Exit20), L(ExitTable))
> + .int JMPTBL(L(Exit21), L(ExitTable))
> + .int JMPTBL(L(Exit22), L(ExitTable))
> + .int JMPTBL(L(Exit23), L(ExitTable))
> + .int JMPTBL(L(Exit24), L(ExitTable))
> + .int JMPTBL(L(Exit25), L(ExitTable))
> + .int JMPTBL(L(Exit26), L(ExitTable))
> + .int JMPTBL(L(Exit27), L(ExitTable))
> + .int JMPTBL(L(Exit28), L(ExitTable))
> + .int JMPTBL(L(Exit29), L(ExitTable))
> + .int JMPTBL(L(Exit30), L(ExitTable))
> + .int JMPTBL(L(Exit31), L(ExitTable))
> + .int JMPTBL(L(Exit32), L(ExitTable))
> +# ifdef USE_AS_STRNCPY
> +L(ExitStrncpyTable):
> + .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> + .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> +# ifndef USE_AS_STRCAT
> + .p2align 4
> +L(FillTable):
> + .int JMPTBL(L(Fill0), L(FillTable))
> + .int JMPTBL(L(Fill1), L(FillTable))
> + .int JMPTBL(L(Fill2), L(FillTable))
> + .int JMPTBL(L(Fill3), L(FillTable))
> + .int JMPTBL(L(Fill4), L(FillTable))
> + .int JMPTBL(L(Fill5), L(FillTable))
> + .int JMPTBL(L(Fill6), L(FillTable))
> + .int JMPTBL(L(Fill7), L(FillTable))
> + .int JMPTBL(L(Fill8), L(FillTable))
> + .int JMPTBL(L(Fill9), L(FillTable))
> + .int JMPTBL(L(Fill10), L(FillTable))
> + .int JMPTBL(L(Fill11), L(FillTable))
> + .int JMPTBL(L(Fill12), L(FillTable))
> + .int JMPTBL(L(Fill13), L(FillTable))
> + .int JMPTBL(L(Fill14), L(FillTable))
> + .int JMPTBL(L(Fill15), L(FillTable))
> + .int JMPTBL(L(Fill16), L(FillTable))
> +# endif
> +# endif
> +#endif
> diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S
> index 6d87a0b..afbd870 100644
> --- a/sysdeps/x86_64/multiarch/strncpy.S
> +++ b/sysdeps/x86_64/multiarch/strncpy.S
> @@ -1,5 +1,85 @@
> -/* Multiple versions of strncpy
> - All versions must be listed in ifunc-impl-list.c. */
> -#define STRCPY strncpy
> +/* Multiple versions of strcpy
> + All versions must be listed in ifunc-impl-list.c.
> + Copyright (C) 2009-2015 Free Software Foundation, Inc.
> + Contributed by Intel Corporation.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +#include <init-arch.h>
> +
> #define USE_AS_STRNCPY
> -#include "strcpy.S"
> +#ifndef STRNCPY
> +#define STRNCPY strncpy
> +#endif
> +
> +#ifdef USE_AS_STPCPY
> +# define STRNCPY_SSSE3 __stpncpy_ssse3
> +# define STRNCPY_SSE2 __stpncpy_sse2
> +# define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> +# define __GI_STRNCPY __GI_stpncpy
> +# define __GI___STRNCPY __GI___stpncpy
> +#else
> +# define STRNCPY_SSSE3 __strncpy_ssse3
> +# define STRNCPY_SSE2 __strncpy_sse2
> +# define STRNCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned
> +# define __GI_STRNCPY __GI_strncpy
> +#endif
> +
> +
> +/* Define multiple versions only for the definition in libc. */
> +#if IS_IN (libc)
> + .text
> +ENTRY(STRNCPY)
> + .type STRNCPY, @gnu_indirect_function
> + cmpl $0, __cpu_features+KIND_OFFSET(%rip)
> + jne 1f
> + call __init_cpu_features
> +1: leaq STRNCPY_SSE2_UNALIGNED(%rip), %rax
> + testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
> + jnz 2f
> + leaq STRNCPY_SSE2(%rip), %rax
> + testl $bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip)
> + jz 2f
> + leaq STRNCPY_SSSE3(%rip), %rax
> +2: ret
> +END(STRNCPY)
> +
> +# undef ENTRY
> +# define ENTRY(name) \
> + .type STRNCPY_SSE2, @function; \
> + .align 16; \
> + .globl STRNCPY_SSE2; \
> + .hidden STRNCPY_SSE2; \
> + STRNCPY_SSE2: cfi_startproc; \
> + CALL_MCOUNT
> +# undef END
> +# define END(name) \
> + cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2
> +# undef libc_hidden_builtin_def
> +/* It doesn't make sense to send libc-internal strcpy calls through a PLT.
> + The speedup we get from using SSSE3 instruction is likely eaten away
> + by the indirect call in the PLT. */
> +# define libc_hidden_builtin_def(name) \
> + .globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2
> +# undef libc_hidden_def
> +# define libc_hidden_def(name) \
> + .globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2
> +#endif
> +
> +#ifndef USE_AS_STRNCPY
> +#include "../strcpy.S"
> +#endif
> --
> 1.8.4.rc3
--
Communications satellite used by the military for star wars.