This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [Patch, AArch64] Optimized strcpy
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Richard Earnshaw <rearnsha at arm dot com>
- Cc: Glibc Development List <libc-alpha at sourceware dot org>
- Date: Thu, 18 Dec 2014 16:15:33 +0100
- Subject: Re: [Patch, AArch64] Optimized strcpy
- Authentication-results: sourceware.org; auth=none
- References: <54917329 dot 4090601 at arm dot com> <5491759B dot 4020704 at arm dot com>
On Wed, Dec 17, 2014 at 12:22:51PM +0000, Richard Earnshaw wrote:
> On 17/12/14 12:12, Richard Earnshaw wrote:
> > This patch contains an optimized implementation of strcpy for AArch64
> > systems. Benchmarking shows that it is approximately 20-25% faster than
> > the generic implementation across the board.
> >
> > R.
> >
> > <date> Richard Earnshaw <rearnsha@arm.com>
> >
> > * sysdeps/aarch64/strcpy.S: New file.
> >
> >
>
> Er, sorry. That's the wrong version of the patch.
>
> Here's the correct one.
>
> R.
Microoptimizations I promised:
> + ldp data1, data2, [srcin]
> + add src, srcin, #16
> + sub tmp1, data1, zeroones
> + orr tmp2, data1, #REP8_7f
> + sub tmp3, data2, zeroones
> + orr tmp4, data2, #REP8_7f
> + bic has_nul1, tmp1, tmp2
> + bics has_nul2, tmp3, tmp4
> + ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */
> + b.ne L(early_end_found)
Flip branch and move copy of early_end_found here, its likely so it would reduce
instruction cache footprint.
> + ldp data1a, data2a, [srcin]
> + stp data1a, data2a, [dst], #16
> + sub dst, dst, to_align
> + /* Everything is now set up, so we can just fall into the bulk
> + copy loop. */
> + /* The inner loop deals with two Dwords at a time. This has a
> + slightly higher start-up cost, but we should win quite quickly,
> + especially on cores with a high number of issue slots per
> + cycle, as we get much better parallelism out of the operations. */
> +L(main_loop):
Again try if aligning loop helps. It sometimes does not make difference
but sometimes loop is twice slower just because of misalignment.
> + ldp data1, data2, [src], #16
> + sub tmp1, data1, zeroones
> + orr tmp2, data1, #REP8_7f
> + sub tmp3, data2, zeroones
> + orr tmp4, data2, #REP8_7f
> + bic has_nul1, tmp1, tmp2
> + bics has_nul2, tmp3, tmp4
> + ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */
> + b.ne L(early_end_found)
This check is unneccesary, its better jump to resume main check like
replacing it with
b L(could_read_crosspage)
which goes here.
tbnz tmp2, #MIN_PAGE_P2, L(page_cross)
#endif
L(could_read_crosspage):
You will check some bytes twice which is ok as this branch is almost
never executed.
> +
> + /* The string is short (<32 bytes). We don't know exactly how
> + short though, yet. Work out the exact length so that we can
> + quickly select the optimal copy strategy. */
> +L(early_end_found):
> + cmp has_nul1, #0
> +#ifdef __AARCH64EB__
> + /* For big-endian, carry propagation (if the final byte in the
> + string is 0x01) means we cannot use has_nul directly. The
> + easiest way to get the correct byte is to byte-swap the data
> + and calculate the syndrome a second time. */
> + csel data1, data1, data2, ne
> + rev data1, data1
> + sub tmp1, data1, zeroones
> + orr tmp2, data1, #REP8_7f
> + bic has_nul1, tmp1, tmp2
> +#else
> + csel has_nul1, has_nul1, has_nul2, ne
> +#endif
Just use branch. You need to decide if string is 8 byte large anyway so
there is no additional misprediction (unless you optimize for size.)
> +L(lt16):
> + /* 8->15 bytes to copy. */
> + ldr data1, [srcin]
These loads is unnecessary in likely case when there is no page crossing.
You already read this at start.
> + ldr data2, [src, #-8]
> + str data1, [dstin]
> + str data2, [dst, #-8]
> + ret
> +L(lt8):
> + cmp len, #4
> + b.lt L(lt4)
> + /* 4->7 bytes to copy. */
> + ldr data1w, [srcin]
> + ldr data2w, [src, #-4]
Same comment as before. You could also create data2w from data1 by
bit-shift. Test if on arm its faster than load.
> + str data1w, [dstin]
> + str data2w, [dst, #-4]
> + ret
> +L(lt4):
> + cmp len, #2
> + b.lt L(lt2)
> + /* 2->3 bytes to copy. */
> + ldrh data1w, [srcin]
> + strh data1w, [dstin]
> + /* Fall-through, one byte (max) to go. */
> +L(lt2):
> + /* Null-terminated string. Last character must be zero! */
> + strb wzr, [dst, #-1]
> + ret
> +END (strcpy)
> +libc_hidden_builtin_def (strcpy)