This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [Patch, AArch64] Optimized strcpy
- From: Richard Earnshaw <rearnsha at arm dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Glibc Development List <libc-alpha at sourceware dot org>
- Date: Thu, 18 Dec 2014 10:55:25 +0000
- Subject: Re: [Patch, AArch64] Optimized strcpy
- Authentication-results: sourceware.org; auth=none
- References: <54917329 dot 4090601 at arm dot com> <20141218010555 dot GA914 at domone>
On 18/12/14 01:05, OndÅej BÃlka wrote:
> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
>> This patch contains an optimized implementation of strcpy for AArch64
>> systems. Benchmarking shows that it is approximately 20-25% faster than
>> the generic implementation across the board.
>>
> I looked quickly for patch, I found two microoptimizations below and
> probable performance problem.
>
Ondrej,
Thanks for looking at this. Unfortunately, you've looked at the wrong
version -- the version I accidentally posted first. The correct version
was a near complete rewrite following my own benchmarking: I posted that
as a follow-up.
> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> more likely than 9-16 bytes so you should optimize that case well.
I'd expect that. But the key question is 'how much more likely'?
Having an inner loop dealing with 16 bytes at a time probably only costs
about 5-10% more time *per iteration* than an inner loop dealing with 8
bytes, since much of the cost is in waiting for the load instruction to
return data from the cache or memory and most of the other instructions
will dual-issue on any reasonable implementation of the architecture.
So to win in the overall game of performance we'd need to show that
short (<8 byte strings) were significantly more likely than 8-16 byte
strings. That seems quite unlikely to me, though I admit I don't have
hard numbers; looking at your data suggests that (approximately) each
larger block size is ~20% of the size of the previous block (I'm
guessing these are 8-byte blocks, but I can't immediately see a
definition), which suggests to me that we have a net win with preferring
16-bytes over 8 bytes per iteration.
Note that starting off with 8-byte blocks and later switching to 16-byte
blocks would most likely involve another unpredictable branch as we
inserted the late realignment check.
R.
>
> See number of calls graph in strcpy function at
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html
>
> My main three strcpy users are mutt, firefox, bash/sh. As first two are
> interactive its hard to do direct benchmark. So you should try how that
> affects bash.
>
> You could try to measure bash running time directly as it uses strcpy
> relatively often. I measured following bash script and if you LD_PRELOAD
> a byte-by-byte loop [A] then it decreases performance by 10%.
>
> http://kam.mff.cuni.cz/~ondra/bashtest.sh
>
> Using bash only differs bit from aggegated case, there is additional peak
> at ~500 bytes caused by copying PATH variable. Otherwise it does things
> like copying each command name ten times before it runs it etc. Graph is
> here.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/result.html
>
>
> [A]
>
> char *
> strcpy (char *dest, const char *src)
> {
> char *destp = dest;
> while (*dest)
> *dest++ = *src++;
>
> *dest = '\0';
>
> return destp;
> }
>
>
> and comments:
>>
>> +ENTRY_ALIGN (strcpy,6)
>> + mov zeroones, #REP8_01
>> + mov dst, dstin
>> + ands tmp1, src, #15
>> + b.ne L(misaligned)
>> + /* NUL detection works on the principle that (X - 1) & (~X) & 0x80
>> + (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
>> + can be done in parallel across the entire word. */
>> + /* The inner loop deals with two Dwords at a time. This has a
>> + slightly higher start-up cost, but we should win quite quickly,
>> + especially on cores with a high number of issue slots per
>> + cycle, as we get much better parallelism out of the operations. */
>> + b L(first_pass)
>
> Why do you use branch instead moving it here?
>
>> +L(main_loop):
>
> Is this aligned because you calculated size of instruction above or its
> just missing?
>
> ...
>> +L(nul_in_data1):
>> + /* Slow path. We can't be sure we've moved at least 8 bytes, so
>> + fall back to a slow byte-by byte store of the bits already
>> + loaded.
>> +
>> + The worst case when coming through this path is that we've had
>> + to copy seven individual bytes to get to alignment and we then
>> + have to copy another seven (eight for big-endian) again here.
>> + We could try to detect that case (and any case where more than
>> + eight bytes have to be copied), but it really doesn't seem
>> + worth it. */
>
> which happens frequently, see above.
>
>
>