This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PATCH][AArch64] Adjust writeback in non-zero memset
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: 'GNU C Library' <libc-alpha at sourceware dot org>
- Cc: nd <nd at arm dot com>
- Date: Wed, 7 Nov 2018 18:09:14 +0000
- Subject: [PATCH][AArch64] Adjust writeback in non-zero memset
This fixes an ineffiency in the non-zero memset. Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.
Tested against the GLIBC testsuite, OK for commit?
ChangeLog:
2018-11-07 Wilco Dijkstra <wdijkstr@arm.com>
* sysdeps/aarch64/memset.S (MEMSET): Improve non-zero memset loop.
---
diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 4a454593618f78e22c55520d56737fab5d8f63a4..2eefc62fc1eeccf736f627a7adfe5485aff9bca9 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -89,10 +89,10 @@ L(set_long):
b.eq L(try_zva)
L(no_zva):
sub count, dstend, dst /* Count is 16 too large. */
- add dst, dst, 16
+ sub dst, dst, 16 /* Dst is biased by -32. */
sub count, count, 64 + 16 /* Adjust count and bias for loop. */
-1: stp q0, q0, [dst], 64
- stp q0, q0, [dst, -32]
+1: stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
L(tail64):
subs count, count, 64
b.hi 1b