This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: [PATCH] Improve performance of strncpy
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: 'Ondřej Bílka' <neleai at seznam dot cz>
- Cc: "'Rich Felker'" <dalias at libc dot org>, "Florian Weimer" <fweimer at redhat dot com>, <azanella at linux dot vnet dot ibm dot com>, <libc-alpha at sourceware dot org>
- Date: Fri, 12 Sep 2014 12:04:07 +0100
- Subject: RE: [PATCH] Improve performance of strncpy
- Authentication-results: sourceware.org; auth=none
- References: <001301cfcd0a$f0b62670$d2227350$ at com> <54108BB0 dot 90902 at redhat dot com> <20140910180144 dot GK23797 at brightrain dot aerifal dot cx> <002501cfcdf7$cc046510$640d2f30$ at com> <20140912062203 dot GB19287 at domone>
> Ondřej Bílka wrote:
> On Thu, Sep 11, 2014 at 08:37:17PM +0100, Wilco Dijkstra wrote:
> > I did a quick experiment with strcpy as it's simpler. Replacing it
> > with memcpy (d, s, strlen (s) + 1) is 3 times faster even on strings
> > of 16Mbytes! Perhaps more surprisingly, it has similar performance on
> > these huge strings as an optimized strcpy.
> >
> What architecture? This could also happen because memcpy has special
> case to handle large strings that speeds this up. Its something that I
> tried in one-pass strcpy but it harms performance as overhead of checking
> size is bigger than benefit of larger size.
The 3x happens on all 3 ISAs I tried. On ARM the memcpy/strlen variant
even beats the optimized strcmp case for most sizes, on x64 it runs at
about 80% of the optimized strcpy for sizes above 4KB.
> > So the results are pretty clear, if you don't have a super optimized
> > strcpy, then strlen+memcpy is the best way to do it.
> >
> It is not that clear as you spend considerable amount of time on small
> lenghts, what is important is constant overhead of strcpy startup.
> However this needs platform specific tricks to decide which alternative
> is fastest.
The overheads are relatively small on modern cores. The memcpy/strlen
is always faster than the single loop for lengths larger than 8-16.
Wilco