This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 07/02/2017 11:01, Siddhesh Poyarekar wrote: > On Tuesday 07 February 2017 06:12 PM, Wilco Dijkstra wrote: >> I agree we want to avoid using conditional compilation as much as possible. >> On the other hand duplication is a bad idea too, I've seen too many cases where >> bugs were only fixed in one of the N duplicates. > > Sure, but then in that case the de-duplication must be done by > identifying a logical code block and make that into a macro to override > and not just arbitrarily inject hunks of code. So in this case it could > be alternate implementations of copy_long that is sufficient so #define > COPY_LONG in both memcpy_generic and memcpy_thunderx and have the parent > (memcpy.S) use that macro. In fact, that might even end up making the > code a bit nicer to read. > >> However I'm actually wondering whether we need an ifunc for this case. >> For large copies from L2 I think adding a prefetch should be benign even on >> cores that don't need it, so if the benchmarks confirm this we should consider >> updating the generic memcpy. > > That is a call that ARM maintainers can take and is also another reason > to separate the IFUNC infrastructure code from the thunderx change. I checked only the memcpy change on a APM X-Gene 1 and results seems to show improvements on aligned input, at least for sizes shorter thatn 4MB. I would like to check on more armv8 chips, but it does seems a nice improvement over generic implementation.
Attachment:
bench-memcpy-large.out
Description: Text document
Attachment:
bench-memcpy-large.patched
Description: Text document
diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S index 29af8b1..4742a01 100644 --- a/sysdeps/aarch64/memcpy.S +++ b/sysdeps/aarch64/memcpy.S @@ -158,10 +158,13 @@ L(copy96): .p2align 4 L(copy_long): + cmp count, #32768 + b.lo L(copy_long_without_prefetch) and tmp1, dstin, 15 bic dst, dstin, 15 ldp D_l, D_h, [src] sub src, src, tmp1 + prfm pldl1strm, [src, 384] add count, count, tmp1 /* Count is now 16 too large. */ ldp A_l, A_h, [src, 16] stp D_l, D_h, [dstin] @@ -169,7 +172,10 @@ L(copy_long): ldp C_l, C_h, [src, 48] ldp D_l, D_h, [src, 64]! subs count, count, 128 + 16 /* Test and readjust count. */ - b.ls 2f + +L(prefetch_loop64): + tbz src, #6, 1f + prfm pldl1strm, [src, 512] 1: stp A_l, A_h, [dst, 16] ldp A_l, A_h, [src, 16] @@ -180,12 +186,39 @@ L(copy_long): stp D_l, D_h, [dst, 64]! ldp D_l, D_h, [src, 64]! subs count, count, 64 - b.hi 1b + b.hi L(prefetch_loop64) + b L(last64) + +L(copy_long_without_prefetch): + + and tmp1, dstin, 15 + bic dst, dstin, 15 + ldp D_l, D_h, [src] + sub src, src, tmp1 + add count, count, tmp1 /* Count is now 16 too large. */ + ldp A_l, A_h, [src, 16] + stp D_l, D_h, [dstin] + ldp B_l, B_h, [src, 32] + ldp C_l, C_h, [src, 48] + ldp D_l, D_h, [src, 64]! + subs count, count, 128 + 16 /* Test and readjust count. */ + b.ls L(last64) +L(loop64): + stp A_l, A_h, [dst, 16] + ldp A_l, A_h, [src, 16] + stp B_l, B_h, [dst, 32] + ldp B_l, B_h, [src, 32] + stp C_l, C_h, [dst, 48] + ldp C_l, C_h, [src, 48] + stp D_l, D_h, [dst, 64]! + ldp D_l, D_h, [src, 64]! + subs count, count, 64 + b.hi L(loop64) /* Write the last full set of 64 bytes. The remainder is at most 64 bytes, so it is safe to always copy 64 bytes from the end even if there is just 1 byte left. */ -2: +L(last64): ldp E_l, E_h, [srcend, -64] stp A_l, A_h, [dst, 16] ldp A_l, A_h, [srcend, -48]
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |