This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH V1] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2


On Mon, Jan 7, 2019 at 12:21 PM
<leonardo.sandoval.gonzalez@linux.intel.com> wrote:
>
> From: Leonardo Sandoval <leonardo.sandoval.gonzalez@linux.intel.com>
>
> Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2.
> It uses vector comparison as much as possible. In general, the larger the
> source string, the greater performance gain observed, reaching speedups of
> 1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat,
> strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is
> preferred and AVX unaligned load is fast.
>
>         * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
>         strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2,
>         stpcpy-avx2 and stpncpy-avx2.
>         * sysdeps/x86_64/multiarch/ifunc-impl-list.c:
>         (__libc_ifunc_impl_list): Add tests for __strcat_avx2,
>         __strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2
>         and __stpncpy_avx2.
>         * sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h =>
>         ifunc-strcpy.h}: rename header for a more generic name.
>         * sysdeps/x86_64/multiarch/ifunc-strcpy.h:
>         (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
>         AVX unaligned load is fast and vzeroupper is preferred.
>         * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
>         * sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
> ---
>
> NOTE: This patch is the same as [V1] (so effectively resending it, sorry for spamming) but the intention
> is to make things clear and see if it can now be merged into master: as suggested V1's comments,
> I went to implement V2 [V2] using generic routines but found out that performance is not as good as
> [3]. The latter are plots computing speed-ups between SSE2 Unaligned versus V1 and SS2 Unaligned versus V2.
> So numbers >1 are speedups, not regressions. V1 curve is most of the times >=1 and above V2, meaning better
> performance compared to SS2 Unaligned and compare to V2.
>
> [V1] https://patchwork.ozlabs.org/patch/980578/
> [V2] https://patchwork.ozlabs.org/patch/1008490/
> [3] https://github.com/lsandoval/strcpy-comparison/blob/master/speedup-v1-v2.png
>

LGTM.

Thanks.

-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]