This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] aarch64: optimize the unaligned case of memcmp

From: Sebastian Pop <s dot pop at samsung dot com>
To: libc-alpha at sourceware dot org, Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
Cc: Marcus dot Shawcroft at arm dot com, maxim dot kuvyrkov at linaro dot org, ramana dot radhakrishnan at arm dot com, ryan dot arnold at linaro dot org, adhemerval dot zanella at linaro dot org, sebpop at gmail dot com
Date: Fri, 23 Jun 2017 14:52:24 -0500
Subject: Re: [PATCH] aarch64: optimize the unaligned case of memcmp
Authentication-results: sourceware.org; auth=none
Cms-type: 301P
References: <CGME20170622233226uscas1p213aefedba5fe47e520aac1226a731162@uscas1p2.samsung.com> <1498174226-16525-1-git-send-email-s.pop@samsung.com>


Wilco Dijkstra <Wilco dot Dijkstra at arm dot com> wrote:
> Huge memcmp's are very rare, so like most string functions the small case
> is more important than the large case.

Agreed.

> So I think you could just skip the complex
> alignment code and get better results in real applications.

If I remove all the alignment code, I get less performance on the hikeyA53 board.

With this patch:

@@ -142,9 +143,23 @@ ENTRY(memcmp)

        .p2align 6
 .Lmisaligned8:
+
+       cmp     limit, #8
+       b.lo    .LmisalignedLt8
+
+       .p2align 5
+.Lloop_part_aligned:
+       ldr     data1, [src1], #8
+       ldr     data2, [src2], #8
+       subs    limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+       eor     diff, data1, data2      /* Non-zero if differences found. */
+       cbnz    diff, .Lnot_limit
+       b.ne    .Lloop_part_aligned
+
+.LmisalignedLt8:
        sub     limit, limit, #1
 1:
-       /* Perhaps we can do better than this.  */
        ldrb    data1w, [src1], #1
        ldrb    data2w, [src2], #1
        subs    limit, limit, #1

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------

BM_string_memcmp_unaligned/8 497 ns 497 ns 132462215.3553MB/sBM_string_memcmp_unaligned/64 884 ns 883 ns 87573569.0874MB/sBM_string_memcmp_unaligned/512 3061 ns 3061 ns 284222159.507MB/sBM_string_memcmp_unaligned/1024 4185 ns 4185 ns 167230233.341MB/sBM_string_memcmp_unaligned/8k 32286 ns 32282 ns 21677242.006MB/sBM_string_memcmp_unaligned/16k 65948 ns 65948 ns 10599236.93MB/sBM_string_memcmp_unaligned/32k 131212 ns 131203 ns 5199238.18MB/sBM_string_memcmp_unaligned/64k 281103 ns 281033 ns 2490222.394MB/s

With the patch I submitted for review yesterday I was gettingconsistently better numbers:

BM_string_memcmp_unaligned/8  287 ns        287 ns    2441787   26.6141MB/s
BM_string_memcmp_unaligned/64 556 ns        556 ns    1257709   109.764MB/s

BM_string_memcmp_unaligned/512 2167 ns 2166 ns 323159225.443MB/sBM_string_memcmp_unaligned/1024 4041 ns 4039 ns 173282241.797MB/sBM_string_memcmp_unaligned/8k 32234 ns 32221 ns 21645242.464MB/sBM_string_memcmp_unaligned/16k 65715 ns 65684 ns 10573237.882MB/sBM_string_memcmp_unaligned/32k 133390 ns 133348 ns 5350234.349MB/sBM_string_memcmp_unaligned/64k 264506 ns 264401 ns 2644236.383MB/s


> Btw why do the max
> alignment thing? - that's a lot of code for very little benefit...

Agreed, I get consistently better performance without the max computation:
Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------

BM_string_memcmp_unaligned/8 282 ns 282 ns 248295327.062MB/sBM_string_memcmp_unaligned/64 542 ns 541 ns 1298317112.77MB/sBM_string_memcmp_unaligned/512 2152 ns 2152 ns 325267226.915MB/sBM_string_memcmp_unaligned/1024 4025 ns 4025 ns 173904242.622MB/sBM_string_memcmp_unaligned/8k 32276 ns 32271 ns 21818242.09MB/sBM_string_memcmp_unaligned/16k 65970 ns 65970 ns 10554236.851MB/sBM_string_memcmp_unaligned/32k 131241 ns 131242 ns 5129238.11MB/sBM_string_memcmp_unaligned/64k 266159 ns 266160 ns 2661234.821MB/s

The numbers are on the same A53 hikey machine with this patch on top ofyesterday's patch:


--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -161,12 +161,7 @@ ENTRY(memcmp)

        and     tmp1, src1, #0x7
        orr     tmp3, xzr, #0x8
-       and     tmp2, src2, #0x7
-       sub     tmp1, tmp3, tmp1
-       sub     tmp2, tmp3, tmp2
-       cmp     tmp1, tmp2
-       /* Choose the maximum. */
-       csel    pos, tmp1, tmp2, hi
+       sub     pos, tmp3, tmp1

/* Increment SRC pointers by POS so one of the SRC pointers isword-aligned. */

        add     src1, src1, pos

And the performance looks about the same with aligning to src2 insteadof src1.

With the extra patch:

--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -159,7 +159,7 @@ ENTRY(memcmp)
        /* Sources are not aligned align one of the sources find max offset
           from aligned boundary. */

-       and     tmp1, src1, #0x7
+       and     tmp1, src2, #0x7
        orr     tmp3, xzr, #0x8
        sub     pos, tmp3, tmp1


Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------

BM_string_memcmp_unaligned/8 282 ns 282 ns 248252427.0603MB/sBM_string_memcmp_unaligned/64 542 ns 542 ns 1292480112.677MB/sBM_string_memcmp_unaligned/512 2151 ns 2151 ns 325450227.021MB/sBM_string_memcmp_unaligned/1024 4026 ns 4025 ns 172599242.643MB/sBM_string_memcmp_unaligned/8k 32251 ns 32239 ns 21508242.332MB/sBM_string_memcmp_unaligned/16k 65828 ns 65800 ns 10649237.46MB/sBM_string_memcmp_unaligned/32k 131058 ns 131002 ns 5254238.545MB/sBM_string_memcmp_unaligned/64k 265991 ns 265792 ns 2651235.146MB/s


Thanks Wilco for your feedback,
Sebastian

Follow-Ups:
- Re: [PATCH] aarch64: optimize the unaligned case of memcmp
  - From: Wilco Dijkstra

References:
- [PATCH] aarch64: optimize the unaligned case of memcmp
  - From: Sebastian Pop

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]