This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v4] aarch64: thunderx2 memcpy optimizations for ext-based code path
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: nd <nd at arm dot com>
- Date: Tue, 26 Mar 2019 20:40:34 +0000
- Subject: Re: [PATCH v4] aarch64: thunderx2 memcpy optimizations for ext-based code path
- References: <5C994D96.6000303@bell-sw.com>
Hi Anton,
> I appreciate you comments very much. Here is the patch
> considering the points you made.
>
> 1. Always taken conditional branch at the beginning is
> removed.
>
> 2. Epilogue code is placed after the end of the loop to
> reduce the number of branches.
>
> 3. The redundant "mov" instructions inside the loop are
> gone due to the changed order of the registers in the ext
> instructions inside the loop.
>
> 4. Invariant code in the loop epilogue is no more
> repeated for each ext chunk.
That looks much better indeed! The alignment can still be improved
though:
819d0: 6e037840 ext v0.16b, v2.16b, v3.16b, #15
819d4: 6e047861 ext v1.16b, v3.16b, v4.16b, #15
819d8: 6e057887 ext v7.16b, v4.16b, v5.16b, #15
819dc: ac810460 stp q0, q1, [x3], #32
819e0: f9814021 prfm pldl1strm, [x1, #640]
819e4: acc10c22 ldp q2, q3, [x1], #32
819e8: 6e0678b0 ext v16.16b, v5.16b, v6.16b, #15
819ec: ac814067 stp q7, q16, [x3], #32
819f0: 6e0278c0 ext v0.16b, v6.16b, v2.16b, #15
819f4: 6e037841 ext v1.16b, v2.16b, v3.16b, #15
819f8: acc11825 ldp q5, q6, [x1], #32
819fc: 6e057867 ext v7.16b, v3.16b, v5.16b, #15
81a00: f1010042 subs x2, x2, #0x40
81a04: 54fffeca b.ge 819dc <__GI___memcpy_thunderx2+0x27c>
So rather than aligning the first instruction as currently done:
#define EXT_CHUNK(shft) \
.p2align 4 ;\
Align the loop instead. If you also add 2 nops after the bx instruction then
everything should work out perfectly.
Cheers,
Wilco