This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[PATCH] aarch64: optimize the unaligned case of memcmp

From: Sebastian Pop <s dot pop at samsung dot com>
To: libc-alpha at sourceware dot org
Cc: Marcus dot Shawcroft at arm dot com, maxim dot kuvyrkov at linaro dot org, ramana dot radhakrishnan at arm dot com, ryan dot arnold at linaro dot org, adhemerval dot zanella at linaro dot org, sebpop at gmail dot com, Sebastian Pop <s dot pop at samsung dot com>
Date: Thu, 22 Jun 2017 18:30:26 -0500
Subject: [PATCH] aarch64: optimize the unaligned case of memcmp
Authentication-results: sourceware.org; auth=none

This brings to glibc a performance improvement that we developed in Bionic libc.
That change has been submitted for review to Bionic libc:
https://android-review.googlesource.com/418279

This patch has been tested on glibc master with "configure; make; make check" on
an aarch64-linux Juno-r0 with no new fails.  We would appreciate help to test
the performance and correctness of this change.

Patch written by Vikas Sinha and Sebastian Pop.  Both Vikas and I are working
for Samsung Austin R&D Center who has a copyright assignment on file with the
FSF for work in glibc.

The performance was measured on the bionic-benchmarks on a hikey (aarch64 8xA53)
board. There was no performance change to the existing benchmark
and a performance improvement on the new benchmark for memcmp
on the unaligned side. The new benchmark has been submitted for
review at https://android-review.googlesource.com/414860

The overall performance improves by 18% for the small data set 8
and the performance improves by 450% for the large data set 64k.

The base is with the libc from /system/lib64. The bionic libc
with this patch is in /data.

hikey:/data # export LD_LIBRARY_PATH=/system/lib64
hikey:/data # ./bionic-benchmarks --benchmark_filter='BM_string_memcmp*'
Run on (8 X 2.4 MHz CPU s)
Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      30 ns         30 ns   22955680    251.07MB/s
BM_string_memcmp/64                     57 ns         57 ns   12349184   1076.99MB/s
BM_string_memcmp/512                   305 ns        305 ns    2297163   1.56496GB/s
BM_string_memcmp/1024                  571 ns        571 ns    1225211   1.66912GB/s
BM_string_memcmp/8k                   4307 ns       4306 ns     162562   1.77177GB/s
BM_string_memcmp/16k                  8676 ns       8675 ns      80676   1.75887GB/s
BM_string_memcmp/32k                 19233 ns      19230 ns      36394   1.58695GB/s
BM_string_memcmp/64k                 36986 ns      36984 ns      18952   1.65029GB/s
BM_string_memcmp_aligned/8             199 ns        199 ns    3519166   38.3336MB/s
BM_string_memcmp_aligned/64            386 ns        386 ns    1810734   158.073MB/s
BM_string_memcmp_aligned/512          1735 ns       1734 ns     403981   281.525MB/s
BM_string_memcmp_aligned/1024         3200 ns       3200 ns     218838   305.151MB/s
BM_string_memcmp_aligned/8k          25084 ns      25080 ns      28180   311.507MB/s
BM_string_memcmp_aligned/16k         51730 ns      51729 ns      13521   302.057MB/s
BM_string_memcmp_aligned/32k        103228 ns     103228 ns       6782   302.727MB/s
BM_string_memcmp_aligned/64k        207117 ns     207087 ns       3450   301.806MB/s
BM_string_memcmp_unaligned/8           339 ns        339 ns    2070998   22.5302MB/s
BM_string_memcmp_unaligned/64         1392 ns       1392 ns     502796   43.8454MB/s
BM_string_memcmp_unaligned/512        9194 ns       9194 ns      76133   53.1104MB/s
BM_string_memcmp_unaligned/1024      18325 ns      18323 ns      38206   53.2963MB/s
BM_string_memcmp_unaligned/8k       148579 ns     148574 ns       4713   52.5831MB/s
BM_string_memcmp_unaligned/16k      298169 ns     298120 ns       2344   52.4118MB/s
BM_string_memcmp_unaligned/32k      598813 ns     598797 ns       1085    52.188MB/s
BM_string_memcmp_unaligned/64k     1196079 ns    1196083 ns        540   52.2539MB/s

hikey:/data # export LD_LIBRARY_PATH=/data
hikey:/data # ./bionic-benchmarks --benchmark_filter='BM_string_memcmp*'
Run on (8 X 2.4 MHz CPU s)
Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      30 ns         30 ns   23209918   252.802MB/s
BM_string_memcmp/64                     57 ns         57 ns   12348447   1076.95MB/s
BM_string_memcmp/512                   305 ns        305 ns    2296878   1.56471GB/s
BM_string_memcmp/1024                  572 ns        571 ns    1224426    1.6689GB/s
BM_string_memcmp/8k                   4309 ns       4308 ns     162491   1.77109GB/s
BM_string_memcmp/16k                  9348 ns       9345 ns      74894   1.63285GB/s
BM_string_memcmp/32k                 18329 ns      18322 ns      38249    1.6656GB/s
BM_string_memcmp/64k                 36992 ns      36981 ns      18952   1.65045GB/s
BM_string_memcmp_aligned/8             199 ns        199 ns    3513925   38.3162MB/s
BM_string_memcmp_aligned/64            386 ns        386 ns    1814038   158.192MB/s
BM_string_memcmp_aligned/512          1735 ns       1735 ns     402279   281.502MB/s
BM_string_memcmp_aligned/1024         3204 ns       3202 ns     218761   304.941MB/s
BM_string_memcmp_aligned/8k          25577 ns      25569 ns      27406   305.548MB/s
BM_string_memcmp_aligned/16k         52143 ns      52123 ns      13522   299.769MB/s
BM_string_memcmp_aligned/32k        105169 ns     105127 ns       6637    297.26MB/s
BM_string_memcmp_aligned/64k        206508 ns     206383 ns       3417   302.835MB/s
BM_string_memcmp_unaligned/8           287 ns        287 ns    2441787   26.6141MB/s
BM_string_memcmp_unaligned/64          556 ns        556 ns    1257709   109.764MB/s
BM_string_memcmp_unaligned/512        2167 ns       2166 ns     323159   225.443MB/s
BM_string_memcmp_unaligned/1024       4041 ns       4039 ns     173282   241.797MB/s
BM_string_memcmp_unaligned/8k        32234 ns      32221 ns      21645   242.464MB/s
BM_string_memcmp_unaligned/16k       65715 ns      65684 ns      10573   237.882MB/s
BM_string_memcmp_unaligned/32k      133390 ns     133348 ns       5350   234.349MB/s
BM_string_memcmp_unaligned/64k      264506 ns     264401 ns       2644   236.383MB/s
---
 sysdeps/aarch64/memcmp.S | 59 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/sysdeps/aarch64/memcmp.S b/sysdeps/aarch64/memcmp.S
index 4cfcb89..d259831 100644
--- a/sysdeps/aarch64/memcmp.S
+++ b/sysdeps/aarch64/memcmp.S
@@ -138,9 +138,66 @@ L(ret0):
 
 	.p2align 6
 L(misaligned8):
+	cmp	limit, #8
+	b.lo	L(misalignedLt8)
+
+L(unalignedGe8):
+
+	/* Load the first dword with both src potentially unaligned.  */
+	ldr	data1, [src1]
+	ldr	data2, [src2]
+
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	cbnz	diff, L(not_limit)
+
+	/* Sources are not aligned align one of the sources find max offset
+	   from aligned boundary.  */
+
+	and	tmp1, src1, #0x7
+	orr	tmp3, xzr, #0x8
+	and	tmp2, src2, #0x7
+	sub	tmp1, tmp3, tmp1
+	sub	tmp2, tmp3, tmp2
+	cmp	tmp1, tmp2
+	/* Choose the maximum.  */
+	csel	pos, tmp1, tmp2, hi
+
+	/* Increment SRC pointers by POS so one of the SRC pointers is word-aligned.  */
+	add	src1, src1, pos
+	add	src2, src2, pos
+
+	sub	limit, limit, pos
+	lsr	limit_wd, limit, #3
+
+	cmp limit_wd, #0
+
+	/* Save #bytes to go back to be able to read 8byte at end
+	   pos=negative offset position to read 8 bytes when len%8 != 0.  */
+	and	limit, limit, #7
+	sub	pos, limit, #8
+
+	b	L(start_part_realigned)
+
+	.p2align 5
+L(loop_part_aligned):
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	subs	limit_wd, limit_wd, #1
+L(start_part_realigned):
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	cbnz	diff, L(not_limit)
+	b.ne	L(loop_part_aligned)
+
+	/* Process leftover bytes: read the leftover bytes, starting with
+	   negative offset - so we can load 8 bytes.  */
+	ldr	data1, [src1, pos]
+	ldr	data2, [src2, pos]
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	b	L(not_limit)
+
+L(misalignedLt8):
 	sub	limit, limit, #1
 1:
-	/* Perhaps we can do better than this.  */
 	ldrb	data1w, [src1], #1
 	ldrb	data2w, [src2], #1
 	subs	limit, limit, #1
-- 
2.6.3

Follow-Ups:
- Re: [PATCH] aarch64: optimize the unaligned case of memcmp
  - From: Sebastian Pop
- Re: [PATCH] aarch64: optimize the unaligned case of memcmp
  - From: Ramana Radhakrishnan

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]