This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
- From: Leonardo Sandoval <leonardo dot sandoval dot gonzalez at linux dot intel dot com>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, libc-alpha at sourceware dot org
- Cc: dalias at libc dot org, fweimer at redhat dot com
- Date: Mon, 17 Dec 2018 10:08:01 -0600
- Subject: Re: [PATCH] x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
- References: <20181008135950.9113-1-leonardo.sandoval.gonzalez@linux.intel.com> <2e43a120-bd68-7581-4b1e-889d5713b2a6@linaro.org> <d32a7600d7514176b4ba62eb27c3a988c9aa5c07.camel@linux.intel.com> <fec40b72-aa40-0442-a8d7-78587c87c5a1@linaro.org>
This patch went for a V2 but I found out that the latter version is
not as fast as V1 [1].
is this ready for merging?
[1] https://sourceware.org/ml/libc-alpha/2018-12/msg00261.html
On Wed, 2018-10-31 at 17:59 -0300, Adhemerval Zanella wrote:
>
> On 31/10/2018 17:23, Leonardo Sandoval wrote:
> > On Wed, 2018-10-31 at 15:36 -0300, Adhemerval Zanella wrote:
> > >
> > > > diff --git a/sysdeps/x86_64/multiarch/strcat-avx2.S
> > > > b/sysdeps/x86_64/multiarch/strcat-avx2.S
> > > > new file mode 100644
> > > > index 00000000000..b0623564276
> > > > --- /dev/null
> > > > +++ b/sysdeps/x86_64/multiarch/strcat-avx2.S
> > > > @@ -0,0 +1,275 @@
> > > > +/* strcat with AVX2
> > >
> > > Is this really a gain on real work usage comparing to generic
> > > strcat
> > > (
> > > (strcpy (dest + strlen (dest), src)) assuming optimized strcpy /
> > > strlen?
> >
> > The speedup is briefly summarized on the commit description but the
> > comparison was done between SSE2 Unaligned and AVX2 (current code).
> > Either SSE2 and AVX2 are much faster the generic strcat (I can
> > provide
> > numbers).
> >
>
> As Rich has put, this strcat optimization seems to use the very
> strategy of strcpy/strlen and basically brings a constant gain with
> downside of i-cache overhead plus maintainability cost (since a
> possible
> bug in strcpy/strlen implementation will need to be checked against
> strcat as well).
>
> I also tend to agree with Rich that strcat is an antipattern, and I
> would prefer we use simpler generic optimization for such cases.
>
> > > Wouldn't be simple and more i-cache friendly to use a custom
> > > generic
> > > implementation that calls AVX2 strcpy/strlen (such as powerpc64
> > > does)?
> >
> > I believe this is what we have right now through IFUNC_SELECTOR.
> >
>
> What I meant was use something like:
>
> char *strcat_avx2 (char *dest, const char *src)
> {
> __strcpy_avx2 (dest + __strlen_avx2 (dest), src);
> return dest;
> }