This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v4] aarch64: thunderx2 memcpy optimizations for ext-based code path


Wilco,

On 3/26/2019 23:40, Wilco Dijkstra wrote:
Hi Anton,

I appreciate you comments very much. Here is the patch
considering the points you made.

1. Always taken conditional branch at the beginning is
removed.

2. Epilogue code is placed after the end of the loop to
reduce the number of branches.

3. The redundant "mov" instructions inside the loop are
gone due to the changed order of the registers in the ext
instructions inside the loop.

4. Invariant code in the loop epilogue is no more
repeated for each ext chunk.

That looks much better indeed! The alignment can still be improved
though:

    819d0:       6e037840        ext     v0.16b, v2.16b, v3.16b, #15
    819d4:       6e047861        ext     v1.16b, v3.16b, v4.16b, #15
    819d8:       6e057887        ext     v7.16b, v4.16b, v5.16b, #15
    819dc:       ac810460        stp     q0, q1, [x3], #32
    819e0:       f9814021        prfm    pldl1strm, [x1, #640]
    819e4:       acc10c22        ldp     q2, q3, [x1], #32
    819e8:       6e0678b0        ext     v16.16b, v5.16b, v6.16b, #15
    819ec:       ac814067        stp     q7, q16, [x3], #32
    819f0:       6e0278c0        ext     v0.16b, v6.16b, v2.16b, #15
    819f4:       6e037841        ext     v1.16b, v2.16b, v3.16b, #15
    819f8:       acc11825        ldp     q5, q6, [x1], #32
    819fc:       6e057867        ext     v7.16b, v3.16b, v5.16b, #15
    81a00:       f1010042        subs    x2, x2, #0x40
    81a04:       54fffeca        b.ge    819dc <__GI___memcpy_thunderx2+0x27c>

So rather than aligning the first instruction as currently done:

#define EXT_CHUNK(shft) \
.p2align 4 ;\

Align the loop instead. If you also add 2 nops after the bx instruction then
everything should work out perfectly.
OK, right, aligning loop body makes more sense than aligning prologue.
Thanks!

But why adding nops at the end (if by "bx" you meant branch) as we do
not care about prologue alignment? If this is about how the next chunk
is aligned, of course.

--
  Thanks,
  Anton


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]