This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch, AArch64] Optimized strcpy


On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
> This patch contains an optimized implementation of strcpy for AArch64
> systems.  Benchmarking shows that it is approximately 20-25% faster than
> the generic implementation across the board.
> 
I looked quickly for patch, I found two microoptimizations below and
probable performance problem.

Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
more likely than 9-16 bytes so you should optimize that case well.

See number of calls graph in strcpy function at

http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html

My main three strcpy users are mutt, firefox, bash/sh. As first two are
interactive its hard to do direct benchmark. So you should try how that
affects bash.

You could try to measure bash running time directly as it uses strcpy
relatively often. I measured following bash script and if you LD_PRELOAD
a byte-by-byte loop [A] then it decreases performance by 10%.

http://kam.mff.cuni.cz/~ondra/bashtest.sh

Using bash only differs bit from aggegated case, there is additional peak 
at ~500 bytes caused by copying PATH variable. Otherwise it does things 
like copying each command name ten times before it runs it etc. Graph is
here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/result.html


[A]

char *
strcpy (char *dest, const char *src)
{
  char *destp = dest;
  while (*dest)
    *dest++ = *src++;

  *dest = '\0';

  return destp;
}


and comments:
> 
> +ENTRY_ALIGN (strcpy,6)
> +	mov	zeroones, #REP8_01
> +	mov	dst, dstin
> +	ands	tmp1, src, #15
> +	b.ne	L(misaligned)
> +	/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
> +	   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
> +	   can be done in parallel across the entire word.  */
> +	/* The inner loop deals with two Dwords at a time.  This has a
> +	   slightly higher start-up cost, but we should win quite quickly,
> +	   especially on cores with a high number of issue slots per
> +	   cycle, as we get much better parallelism out of the operations.  */
> +	b	L(first_pass)

Why do you use branch instead moving it here?

> +L(main_loop):

Is this aligned because you calculated size of instruction above or its
just missing?

...
> +L(nul_in_data1):
> +	/* Slow path.  We can't be sure we've moved at least 8 bytes, so
> +	   fall back to a slow byte-by byte store of the bits already
> +	   loaded.
> +
> +	   The worst case when coming through this path is that we've had
> +	   to copy seven individual bytes to get to alignment and we then
> +	   have to copy another seven (eight for big-endian) again here.
> +	   We could try to detect that case (and any case where more than
> +	   eight bytes have to be copied), but it really doesn't seem
> +	   worth it.  */

which happens frequently, see above.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]