This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH V1] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Leonardo Sandoval <leonardo dot sandoval dot gonzalez at linux dot intel dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 10 Jan 2019 10:29:18 -0800
- Subject: Re: [PATCH V1] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
- References: <20190107202122.23215-1-leonardo.sandoval.gonzalez@linux.intel.com>
On Mon, Jan 7, 2019 at 12:21 PM
<leonardo.sandoval.gonzalez@linux.intel.com> wrote:
>
> From: Leonardo Sandoval <leonardo.sandoval.gonzalez@linux.intel.com>
>
> Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2.
> It uses vector comparison as much as possible. In general, the larger the
> source string, the greater performance gain observed, reaching speedups of
> 1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat,
> strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is
> preferred and AVX unaligned load is fast.
>
> * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
> strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2,
> stpcpy-avx2 and stpncpy-avx2.
> * sysdeps/x86_64/multiarch/ifunc-impl-list.c:
> (__libc_ifunc_impl_list): Add tests for __strcat_avx2,
> __strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2
> and __stpncpy_avx2.
> * sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h =>
> ifunc-strcpy.h}: rename header for a more generic name.
> * sysdeps/x86_64/multiarch/ifunc-strcpy.h:
> (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
> AVX unaligned load is fast and vzeroupper is preferred.
> * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
> * sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise
> * sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise
> * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise
> * sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise
> * sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
> ---
>
> NOTE: This patch is the same as [V1] (so effectively resending it, sorry for spamming) but the intention
> is to make things clear and see if it can now be merged into master: as suggested V1's comments,
> I went to implement V2 [V2] using generic routines but found out that performance is not as good as
> [3]. The latter are plots computing speed-ups between SSE2 Unaligned versus V1 and SS2 Unaligned versus V2.
> So numbers >1 are speedups, not regressions. V1 curve is most of the times >=1 and above V2, meaning better
> performance compared to SS2 Unaligned and compare to V2.
>
> [V1] https://patchwork.ozlabs.org/patch/980578/
> [V2] https://patchwork.ozlabs.org/patch/1008490/
> [3] https://github.com/lsandoval/strcpy-comparison/blob/master/speedup-v1-v2.png
>
LGTM.
Thanks.
--
H.J.