This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: optimize the unaligned case of memcmp
- From: Sebastian Pop <s dot pop at samsung dot com>
- To: libc-alpha at sourceware dot org, Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- Cc: Marcus dot Shawcroft at arm dot com, maxim dot kuvyrkov at linaro dot org, ramana dot radhakrishnan at arm dot com, ryan dot arnold at linaro dot org, adhemerval dot zanella at linaro dot org, sebpop at gmail dot com
- Date: Fri, 23 Jun 2017 14:52:24 -0500
- Subject: Re: [PATCH] aarch64: optimize the unaligned case of memcmp
- Authentication-results: sourceware.org; auth=none
- Cms-type: 301P
- References: <CGME20170622233226uscas1p213aefedba5fe47e520aac1226a731162@uscas1p2.samsung.com> <1498174226-16525-1-git-send-email-s.pop@samsung.com>
Wilco Dijkstra <Wilco dot Dijkstra at arm dot com> wrote:
> Huge memcmp's are very rare, so like most string functions the small case
> is more important than the large case.
Agreed.
> So I think you could just skip the complex
> alignment code and get better results in real applications.
If I remove all the alignment code, I get less performance on the hikey
A53 board.
With this patch:
@@ -142,9 +143,23 @@ ENTRY(memcmp)
.p2align 6
.Lmisaligned8:
+
+ cmp limit, #8
+ b.lo .LmisalignedLt8
+
+ .p2align 5
+.Lloop_part_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ subs limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ cbnz diff, .Lnot_limit
+ b.ne .Lloop_part_aligned
+
+.LmisalignedLt8:
sub limit, limit, #1
1:
- /* Perhaps we can do better than this. */
ldrb data1w, [src1], #1
ldrb data2w, [src2], #1
subs limit, limit, #1
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 497 ns 497 ns 1324622
15.3553MB/s
BM_string_memcmp_unaligned/64 884 ns 883 ns 875735
69.0874MB/s
BM_string_memcmp_unaligned/512 3061 ns 3061 ns 284222
159.507MB/s
BM_string_memcmp_unaligned/1024 4185 ns 4185 ns 167230
233.341MB/s
BM_string_memcmp_unaligned/8k 32286 ns 32282 ns 21677
242.006MB/s
BM_string_memcmp_unaligned/16k 65948 ns 65948 ns 10599
236.93MB/s
BM_string_memcmp_unaligned/32k 131212 ns 131203 ns 5199
238.18MB/s
BM_string_memcmp_unaligned/64k 281103 ns 281033 ns 2490
222.394MB/s
With the patch I submitted for review yesterday I was getting
consistently better numbers:
BM_string_memcmp_unaligned/8 287 ns 287 ns 2441787 26.6141MB/s
BM_string_memcmp_unaligned/64 556 ns 556 ns 1257709 109.764MB/s
BM_string_memcmp_unaligned/512 2167 ns 2166 ns 323159
225.443MB/s
BM_string_memcmp_unaligned/1024 4041 ns 4039 ns 173282
241.797MB/s
BM_string_memcmp_unaligned/8k 32234 ns 32221 ns 21645
242.464MB/s
BM_string_memcmp_unaligned/16k 65715 ns 65684 ns 10573
237.882MB/s
BM_string_memcmp_unaligned/32k 133390 ns 133348 ns 5350
234.349MB/s
BM_string_memcmp_unaligned/64k 264506 ns 264401 ns 2644
236.383MB/s
> Btw why do the max
> alignment thing? - that's a lot of code for very little benefit...
Agreed, I get consistently better performance without the max computation:
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 282 ns 282 ns 2482953
27.062MB/s
BM_string_memcmp_unaligned/64 542 ns 541 ns 1298317
112.77MB/s
BM_string_memcmp_unaligned/512 2152 ns 2152 ns 325267
226.915MB/s
BM_string_memcmp_unaligned/1024 4025 ns 4025 ns 173904
242.622MB/s
BM_string_memcmp_unaligned/8k 32276 ns 32271 ns 21818
242.09MB/s
BM_string_memcmp_unaligned/16k 65970 ns 65970 ns 10554
236.851MB/s
BM_string_memcmp_unaligned/32k 131241 ns 131242 ns 5129
238.11MB/s
BM_string_memcmp_unaligned/64k 266159 ns 266160 ns 2661
234.821MB/s
The numbers are on the same A53 hikey machine with this patch on top of
yesterday's patch:
--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -161,12 +161,7 @@ ENTRY(memcmp)
and tmp1, src1, #0x7
orr tmp3, xzr, #0x8
- and tmp2, src2, #0x7
- sub tmp1, tmp3, tmp1
- sub tmp2, tmp3, tmp2
- cmp tmp1, tmp2
- /* Choose the maximum. */
- csel pos, tmp1, tmp2, hi
+ sub pos, tmp3, tmp1
/* Increment SRC pointers by POS so one of the SRC pointers is
word-aligned. */
add src1, src1, pos
And the performance looks about the same with aligning to src2 instead
of src1.
With the extra patch:
--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -159,7 +159,7 @@ ENTRY(memcmp)
/* Sources are not aligned align one of the sources find max offset
from aligned boundary. */
- and tmp1, src1, #0x7
+ and tmp1, src2, #0x7
orr tmp3, xzr, #0x8
sub pos, tmp3, tmp1
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 282 ns 282 ns 2482524
27.0603MB/s
BM_string_memcmp_unaligned/64 542 ns 542 ns 1292480
112.677MB/s
BM_string_memcmp_unaligned/512 2151 ns 2151 ns 325450
227.021MB/s
BM_string_memcmp_unaligned/1024 4026 ns 4025 ns 172599
242.643MB/s
BM_string_memcmp_unaligned/8k 32251 ns 32239 ns 21508
242.332MB/s
BM_string_memcmp_unaligned/16k 65828 ns 65800 ns 10649
237.46MB/s
BM_string_memcmp_unaligned/32k 131058 ns 131002 ns 5254
238.545MB/s
BM_string_memcmp_unaligned/64k 265991 ns 265792 ns 2651
235.146MB/s
Thanks Wilco for your feedback,
Sebastian