This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: optimize the unaligned case of memcmp




On 06/23/2017 04:28 PM, Wilco Dijkstra wrote:
Sebastian Pop wrote:

If I remove all the alignment code, I get less performance on the hikey
A53 board.
With this patch:
@@ -142,9 +143,23 @@ ENTRY(memcmp)

          .p2align 6
   .Lmisaligned8:
+
+       cmp     limit, #8
+       b.lo    .LmisalignedLt8
+
+       .p2align 5
+.Lloop_part_aligned:
+       ldr     data1, [src1], #8
+       ldr     data2, [src2], #8
+       subs    limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+       eor     diff, data1, data2      /* Non-zero if differences found. */
+       cbnz    diff, .Lnot_limit
+       b.ne    .Lloop_part_aligned
+
+.LmisalignedLt8:
          sub     limit, limit, #1
   1:
-       /* Perhaps we can do better than this.  */
          ldrb    data1w, [src1], #1
          ldrb    data2w, [src2], #1
          subs    limit, limit, #1

Where is the setup of limit_wd and limit???

You are right, my patch was not quite correct: I was missing the initialization of limit_wd, like so:

lsr     limit_wd, limit, #3

limit is the number of bytes to be compared passed in as a parameter to memcmp.

With this extra statement I am still seeing the same low performance on A53 hikey:

Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8 345 ns 345 ns 2026483 22.0879MB/s BM_string_memcmp_unaligned/16 539 ns 539 ns 1298687 28.3159MB/s BM_string_memcmp_unaligned/20 613 ns 613 ns 1142076 31.1222MB/s BM_string_memcmp_unaligned/30 794 ns 794 ns 881596 36.0357MB/s BM_string_memcmp_unaligned/42 957 ns 957 ns 731746 41.8753MB/s BM_string_memcmp_unaligned/55 1208 ns 1207 ns 579591 43.4525MB/s BM_string_memcmp_unaligned/60 1231 ns 1231 ns 568372 46.4756MB/s BM_string_memcmp_unaligned/64 1312 ns 1312 ns 475862 46.5316MB/s

The base is with no patch applied to memcmp.S: (byte by byte memcmp)

Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8 339 ns 339 ns 2066820 22.5274MB/s BM_string_memcmp_unaligned/16 536 ns 536 ns 1306265 28.4901MB/s BM_string_memcmp_unaligned/20 612 ns 612 ns 1146573 31.1479MB/s BM_string_memcmp_unaligned/30 789 ns 789 ns 886755 36.2472MB/s BM_string_memcmp_unaligned/42 1009 ns 1009 ns 693760 39.7MB/s BM_string_memcmp_unaligned/55 1233 ns 1233 ns 567719 42.5469MB/s BM_string_memcmp_unaligned/60 1322 ns 1322 ns 529511 43.2804MB/s BM_string_memcmp_unaligned/64 1392 ns 1392 ns 502817 43.8426MB/s

And with the patch submitted for review without computing max and aligning on src1:

Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8 282 ns 282 ns 2482713 27.061MB/s BM_string_memcmp_unaligned/16 304 ns 304 ns 2300275 50.1401MB/s BM_string_memcmp_unaligned/20 322 ns 322 ns 2176437 59.2469MB/s BM_string_memcmp_unaligned/30 352 ns 352 ns 1988315 81.328MB/s BM_string_memcmp_unaligned/42 412 ns 412 ns 1699818 97.317MB/s BM_string_memcmp_unaligned/55 503 ns 503 ns 1393029 104.382MB/s BM_string_memcmp_unaligned/60 522 ns 522 ns 1340682 109.619MB/s BM_string_memcmp_unaligned/64 541 ns 541 ns 1297637 112.891MB/s



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]