This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [Patch, AArch64] Optimized strcpy

From: Richard Earnshaw <rearnsha at arm dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Glibc Development List <libc-alpha at sourceware dot org>
Date: Thu, 18 Dec 2014 10:55:25 +0000
Subject: Re: [Patch, AArch64] Optimized strcpy
Authentication-results: sourceware.org; auth=none
References: <54917329 dot 4090601 at arm dot com> <20141218010555 dot GA914 at domone>

On 18/12/14 01:05, OndÅej BÃlka wrote:
> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
>> This patch contains an optimized implementation of strcpy for AArch64
>> systems.  Benchmarking shows that it is approximately 20-25% faster than
>> the generic implementation across the board.
>>
> I looked quickly for patch, I found two microoptimizations below and
> probable performance problem.
> 

Ondrej,

Thanks for looking at this.  Unfortunately, you've looked at the wrong
version -- the version I accidentally posted first.  The correct version
was a near complete rewrite following my own benchmarking: I posted that
as a follow-up.

> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> more likely than 9-16 bytes so you should optimize that case well.

I'd expect that.  But the key question is 'how much more likely'?
Having an inner loop dealing with 16 bytes at a time probably only costs
about 5-10% more time *per iteration* than an inner loop dealing with 8
bytes, since much of the cost is in waiting for the load instruction to
return data from the cache or memory and most of the other instructions
will dual-issue on any reasonable implementation of the architecture.
So to win in the overall game of performance we'd need to show that
short (<8 byte strings) were significantly more likely than 8-16 byte
strings.  That seems quite unlikely to me, though I admit I don't have
hard numbers; looking at your data suggests that (approximately) each
larger block size is ~20% of the size of the previous block (I'm
guessing these are 8-byte blocks, but I can't immediately see a
definition), which suggests to me that we have a net win with preferring
16-bytes over 8 bytes per iteration.

Note that starting off with 8-byte blocks and later switching to 16-byte
blocks would most likely involve another unpredictable branch as we
inserted the late realignment check.

R.

> 
> See number of calls graph in strcpy function at
> 
> http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html
> 
> My main three strcpy users are mutt, firefox, bash/sh. As first two are
> interactive its hard to do direct benchmark. So you should try how that
> affects bash.
> 
> You could try to measure bash running time directly as it uses strcpy
> relatively often. I measured following bash script and if you LD_PRELOAD
> a byte-by-byte loop [A] then it decreases performance by 10%.
> 
> http://kam.mff.cuni.cz/~ondra/bashtest.sh
> 
> Using bash only differs bit from aggegated case, there is additional peak 
> at ~500 bytes caused by copying PATH variable. Otherwise it does things 
> like copying each command name ten times before it runs it etc. Graph is
> here.
> 
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/result.html
> 
> 
> [A]
> 
> char *
> strcpy (char *dest, const char *src)
> {
>   char *destp = dest;
>   while (*dest)
>     *dest++ = *src++;
> 
>   *dest = '\0';
> 
>   return destp;
> }
> 
> 
> and comments:
>>
>> +ENTRY_ALIGN (strcpy,6)
>> +	mov	zeroones, #REP8_01
>> +	mov	dst, dstin
>> +	ands	tmp1, src, #15
>> +	b.ne	L(misaligned)
>> +	/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
>> +	   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
>> +	   can be done in parallel across the entire word.  */
>> +	/* The inner loop deals with two Dwords at a time.  This has a
>> +	   slightly higher start-up cost, but we should win quite quickly,
>> +	   especially on cores with a high number of issue slots per
>> +	   cycle, as we get much better parallelism out of the operations.  */
>> +	b	L(first_pass)
> 
> Why do you use branch instead moving it here?
> 
>> +L(main_loop):
> 
> Is this aligned because you calculated size of instruction above or its
> just missing?
> 
> ...
>> +L(nul_in_data1):
>> +	/* Slow path.  We can't be sure we've moved at least 8 bytes, so
>> +	   fall back to a slow byte-by byte store of the bits already
>> +	   loaded.
>> +
>> +	   The worst case when coming through this path is that we've had
>> +	   to copy seven individual bytes to get to alignment and we then
>> +	   have to copy another seven (eight for big-endian) again here.
>> +	   We could try to detect that case (and any case where more than
>> +	   eight bytes have to be copied), but it really doesn't seem
>> +	   worth it.  */
> 
> which happens frequently, see above.
> 
> 
>

Follow-Ups:
- Re: [Patch, AArch64] Optimized strcpy
  - From: OndÅej BÃlka

References:
- [Patch, AArch64] Optimized strcpy
  - From: Richard Earnshaw
- Re: [Patch, AArch64] Optimized strcpy
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]