This is the mail archive of the
mailing list for the glibc project.
Re: [Patch, AArch64] Optimized strcpy
- From: Richard Earnshaw <rearnsha at arm dot com>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Thu, 18 Dec 2014 12:45:12 +0000
- Subject: Re: [Patch, AArch64] Optimized strcpy
- Authentication-results: sourceware.org; auth=none
- References: <54917329 dot 4090601 at arm dot com> <20141218010555 dot GA914 at domone> <54923160 dot 90600 at linux dot vnet dot ibm dot com>
On 18/12/14 01:44, Adhemerval Zanella wrote:
> On 17-12-2014 23:05, OndÅej BÃlka wrote:
>> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
>>> This patch contains an optimized implementation of strcpy for AArch64
>>> systems. Benchmarking shows that it is approximately 20-25% faster than
>>> the generic implementation across the board.
>> I looked quickly for patch, I found two microoptimizations below and
>> probable performance problem.
>> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
>> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
>> more likely than 9-16 bytes so you should optimize that case well.
>> See number of calls graph in strcpy function at
>> My main three strcpy users are mutt, firefox, bash/sh. As first two are
>> interactive its hard to do direct benchmark. So you should try how that
>> affects bash.
>> You could try to measure bash running time directly as it uses strcpy
>> relatively often. I measured following bash script and if you LD_PRELOAD
>> a byte-by-byte loop [A] then it decreases performance by 10%.
> I gave a try to this benchmark and, at least on powerpc64, I could not see *any*
> real strcpy usage running using bash (version 4.3.11). And profilers also do
> not show strcpy being a hotspots. Also, and I don't know if it is intentional,
> it is issuing 'gnuplot2'.
> And I see your characterized strcpy being direct from desktop workloads, which
> is a limited usage behavior. For powerpc64, as example, 'firefox' and 'mutt'
> is far from the default workloads running on it.
> So the focus is not that '1-8 *are* hot path' or if you 'profiler' shows that
> mostly of calls are less than 16 bytes, but what kind of workloads aarch64
> implementation are trying to solve here. It would be good if the patch
> proposal disclaimer better what kind of tests it did, like if he only ran
> the GLIBC benchtests of if he is trying to optimize for a real case usage.
I'm not targeting a specific workload; the code is intended to be
generic. On that basis I made the following assumptions:
String lengths would likely follow a negative exponential distribution
over a large set of workloads. Individual workloads would undoubtedly
not follow such a distribution.
Sequential calls to the string routine would not have a predictable
alignment or length of the source string, so branch predictors would be
unlikely to be able to accurately predict the outcome of most branches.
Source strings might be aligned to an 8-byte boundary but are unlikely
to be specifically aligned to a 16-byte boundary (ie most likely ~50% of
8-byte aligned strings will also be 16-byte aligned). It's unlikely
that proportion of 8-byte aligned strings would significantly dominate
the profile to the extent that a branch predictor would be able to make
real use of this property.
The start address of source strings will be randomly distributed across
a page (making a page-cross check very predictable: ~99.5%).
Running the glibc strcmp benchmark showed that the new code was 20-25%
faster than the existing C code for *all* cases measured, not just at
specific points; so overall this is a notable improvement. Note that
the C code uses strlen and memcpy, so is already using similar
techniques and is pretty good now.
Note, the version of the patch that I intended to submit is this one
(not the version first posted):
It's worth noting that if, say 75% or more of strings are <= 16 bytes in
length it's likely that the main path through code will be reasonably
predictable since we avoid any loops in that case. The only
hard-to-predict branches will be those in the final decision tree needed
to handle copies of < 16 bytes. In those cases we split the cases into
1 byte ; 2-3 bytes ; 4-7 bytes; 8-16 bytes
and binary chop between them.
>> Using bash only differs bit from aggegated case, there is additional peak
>> at ~500 bytes caused by copying PATH variable. Otherwise it does things
>> like copying each command name ten times before it runs it etc. Graph is
>> char *
>> strcpy (char *dest, const char *src)
>> char *destp = dest;
>> while (*dest)
>> *dest++ = *src++;
>> *dest = '\0';
>> return destp;
>> and comments:
>>> +ENTRY_ALIGN (strcpy,6)
>>> + mov zeroones, #REP8_01
>>> + mov dst, dstin
>>> + ands tmp1, src, #15
>>> + b.ne L(misaligned)
>>> + /* NUL detection works on the principle that (X - 1) & (~X) & 0x80
>>> + (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
>>> + can be done in parallel across the entire word. */
>>> + /* The inner loop deals with two Dwords at a time. This has a
>>> + slightly higher start-up cost, but we should win quite quickly,
>>> + especially on cores with a high number of issue slots per
>>> + cycle, as we get much better parallelism out of the operations. */
>>> + b L(first_pass)
>> Why do you use branch instead moving it here?
>> Is this aligned because you calculated size of instruction above or its
>> just missing?
>>> + /* Slow path. We can't be sure we've moved at least 8 bytes, so
>>> + fall back to a slow byte-by byte store of the bits already
>>> + loaded.
>>> + The worst case when coming through this path is that we've had
>>> + to copy seven individual bytes to get to alignment and we then
>>> + have to copy another seven (eight for big-endian) again here.
>>> + We could try to detect that case (and any case where more than
>>> + eight bytes have to be copied), but it really doesn't seem
>>> + worth it. */
>> which happens frequently, see above.