This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH V1] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2

From: "H.J. Lu" <hjl dot tools at gmail dot com>
To: Leonardo Sandoval <leonardo dot sandoval dot gonzalez at linux dot intel dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Thu, 10 Jan 2019 10:29:18 -0800
Subject: Re: [PATCH V1] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
References: <20190107202122.23215-1-leonardo.sandoval.gonzalez@linux.intel.com>

On Mon, Jan 7, 2019 at 12:21 PM
<leonardo.sandoval.gonzalez@linux.intel.com> wrote:
>
> From: Leonardo Sandoval <leonardo.sandoval.gonzalez@linux.intel.com>
>
> Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2.
> It uses vector comparison as much as possible. In general, the larger the
> source string, the greater performance gain observed, reaching speedups of
> 1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat,
> strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is
> preferred and AVX unaligned load is fast.
>
>         * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
>         strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2,
>         stpcpy-avx2 and stpncpy-avx2.
>         * sysdeps/x86_64/multiarch/ifunc-impl-list.c:
>         (__libc_ifunc_impl_list): Add tests for __strcat_avx2,
>         __strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2
>         and __stpncpy_avx2.
>         * sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h =>
>         ifunc-strcpy.h}: rename header for a more generic name.
>         * sysdeps/x86_64/multiarch/ifunc-strcpy.h:
>         (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
>         AVX unaligned load is fast and vzeroupper is preferred.
>         * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
>         * sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise
>         * sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
> ---
>
> NOTE: This patch is the same as [V1] (so effectively resending it, sorry for spamming) but the intention
> is to make things clear and see if it can now be merged into master: as suggested V1's comments,
> I went to implement V2 [V2] using generic routines but found out that performance is not as good as
> [3]. The latter are plots computing speed-ups between SSE2 Unaligned versus V1 and SS2 Unaligned versus V2.
> So numbers >1 are speedups, not regressions. V1 curve is most of the times >=1 and above V2, meaning better
> performance compared to SS2 Unaligned and compare to V2.
>
> [V1] https://patchwork.ozlabs.org/patch/980578/
> [V2] https://patchwork.ozlabs.org/patch/1008490/
> [3] https://github.com/lsandoval/strcpy-comparison/blob/master/speedup-v1-v2.png
>

LGTM.

Thanks.

-- 
H.J.

References:
- [PATCH V1] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
  - From: leonardo . sandoval . gonzalez

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]