This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: optimize the unaligned case of memcmp



Wilco Dijkstra <Wilco dot Dijkstra at arm dot com> wrote:
> Huge memcmp's are very rare, so like most string functions the small case
> is more important than the large case.

Agreed.

> So I think you could just skip the complex
> alignment code and get better results in real applications.

If I remove all the alignment code, I get less performance on the hikey A53 board.
With this patch:

@@ -142,9 +143,23 @@ ENTRY(memcmp)

        .p2align 6
 .Lmisaligned8:
+
+       cmp     limit, #8
+       b.lo    .LmisalignedLt8
+
+       .p2align 5
+.Lloop_part_aligned:
+       ldr     data1, [src1], #8
+       ldr     data2, [src2], #8
+       subs    limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+       eor     diff, data1, data2      /* Non-zero if differences found. */
+       cbnz    diff, .Lnot_limit
+       b.ne    .Lloop_part_aligned
+
+.LmisalignedLt8:
        sub     limit, limit, #1
 1:
-       /* Perhaps we can do better than this.  */
        ldrb    data1w, [src1], #1
        ldrb    data2w, [src2], #1
        subs    limit, limit, #1

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 497 ns 497 ns 1324622 15.3553MB/s BM_string_memcmp_unaligned/64 884 ns 883 ns 875735 69.0874MB/s BM_string_memcmp_unaligned/512 3061 ns 3061 ns 284222 159.507MB/s BM_string_memcmp_unaligned/1024 4185 ns 4185 ns 167230 233.341MB/s BM_string_memcmp_unaligned/8k 32286 ns 32282 ns 21677 242.006MB/s BM_string_memcmp_unaligned/16k 65948 ns 65948 ns 10599 236.93MB/s BM_string_memcmp_unaligned/32k 131212 ns 131203 ns 5199 238.18MB/s BM_string_memcmp_unaligned/64k 281103 ns 281033 ns 2490 222.394MB/s

With the patch I submitted for review yesterday I was getting consistently better numbers:
BM_string_memcmp_unaligned/8  287 ns        287 ns    2441787   26.6141MB/s
BM_string_memcmp_unaligned/64 556 ns        556 ns    1257709   109.764MB/s
BM_string_memcmp_unaligned/512 2167 ns 2166 ns 323159 225.443MB/s BM_string_memcmp_unaligned/1024 4041 ns 4039 ns 173282 241.797MB/s BM_string_memcmp_unaligned/8k 32234 ns 32221 ns 21645 242.464MB/s BM_string_memcmp_unaligned/16k 65715 ns 65684 ns 10573 237.882MB/s BM_string_memcmp_unaligned/32k 133390 ns 133348 ns 5350 234.349MB/s BM_string_memcmp_unaligned/64k 264506 ns 264401 ns 2644 236.383MB/s

> Btw why do the max
> alignment thing? - that's a lot of code for very little benefit...

Agreed, I get consistently better performance without the max computation:
Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 282 ns 282 ns 2482953 27.062MB/s BM_string_memcmp_unaligned/64 542 ns 541 ns 1298317 112.77MB/s BM_string_memcmp_unaligned/512 2152 ns 2152 ns 325267 226.915MB/s BM_string_memcmp_unaligned/1024 4025 ns 4025 ns 173904 242.622MB/s BM_string_memcmp_unaligned/8k 32276 ns 32271 ns 21818 242.09MB/s BM_string_memcmp_unaligned/16k 65970 ns 65970 ns 10554 236.851MB/s BM_string_memcmp_unaligned/32k 131241 ns 131242 ns 5129 238.11MB/s BM_string_memcmp_unaligned/64k 266159 ns 266160 ns 2661 234.821MB/s

The numbers are on the same A53 hikey machine with this patch on top of yesterday's patch:

--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -161,12 +161,7 @@ ENTRY(memcmp)

        and     tmp1, src1, #0x7
        orr     tmp3, xzr, #0x8
-       and     tmp2, src2, #0x7
-       sub     tmp1, tmp3, tmp1
-       sub     tmp2, tmp3, tmp2
-       cmp     tmp1, tmp2
-       /* Choose the maximum. */
-       csel    pos, tmp1, tmp2, hi
+       sub     pos, tmp3, tmp1

/* Increment SRC pointers by POS so one of the SRC pointers is word-aligned. */
        add     src1, src1, pos

And the performance looks about the same with aligning to src2 instead of src1.
With the extra patch:

--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -159,7 +159,7 @@ ENTRY(memcmp)
        /* Sources are not aligned align one of the sources find max offset
           from aligned boundary. */

-       and     tmp1, src1, #0x7
+       and     tmp1, src2, #0x7
        orr     tmp3, xzr, #0x8
        sub     pos, tmp3, tmp1


Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 282 ns 282 ns 2482524 27.0603MB/s BM_string_memcmp_unaligned/64 542 ns 542 ns 1292480 112.677MB/s BM_string_memcmp_unaligned/512 2151 ns 2151 ns 325450 227.021MB/s BM_string_memcmp_unaligned/1024 4026 ns 4025 ns 172599 242.643MB/s BM_string_memcmp_unaligned/8k 32251 ns 32239 ns 21508 242.332MB/s BM_string_memcmp_unaligned/16k 65828 ns 65800 ns 10649 237.46MB/s BM_string_memcmp_unaligned/32k 131058 ns 131002 ns 5254 238.545MB/s BM_string_memcmp_unaligned/64k 265991 ns 265792 ns 2651 235.146MB/s

Thanks Wilco for your feedback,
Sebastian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]