This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PATCH] MIPS memcpy performance improvement
- From: "Steve Ellcey " <sellcey at imgtec dot com>
- To: <libc-alpha at sourceware dot org>
- Date: Fri, 16 Oct 2015 12:21:48 -0700
- Subject: [PATCH] MIPS memcpy performance improvement
- Authentication-results: sourceware.org; auth=none
It was brought to my attention that the MIPS N32 (and N64) memcpy was slower
than the MIPS O32 memcpy for small (less than 16 byte) aligned memcpy's.
This is because for sizes of 8 to 15 bytes, the O32 memcpy would do two
or three word copies followed by byte copies but the N32 version would do
all byte copies. Basically, the N32 version did not 'fall back' to doing
word copies when it could not do double-word copies.
This patch addresses the problem with two changes. One is actually for
large memcpy's on N32. After doing as many double-word copies as possible
the N32 version will try do do at least one word copy before going to byte
copies.
The other change is that after determining that a memcpy is small (less than
8 bytes for O32 ABI, less than 16 bytes for N32 or N64 ABI), instead of just
doing byte copies it will check the size and alignment of the inputs and,
if possible, do word copies (followed by byte copies if needed). If it is
not possible to do word copies due to size or alignment it drops back to byte
copies as before.
The glibc memcpy benchmark does not have any tests that catch the first
case (though my own testing showed a small improvement), but it does test
the second case. In these cases, for inputs of length 4 to 15 bytes
(depending on the ABI), the tests are slower for unaligned memcpy's and
faster for aligned ones. There is also a slow down for memcpy's of less
than 4 bytes regardless of alignment.
For example with O32:
Original:
Length 1, alignment 0/ 0: 54.3906 54.0156 37.9062
Length 4, alignment 0/ 0: 66.8438 70.6562 65.2969
Length 4, alignment 2/ 0: 65.8438 71.1406 65.25
Length 8, alignment 0/ 0: 73.7344 82.2656 74.2969
Length 8, alignment 3/ 0: 74.3906 76.6875 74.25
With change:
Length 1, alignment 0/ 0: 61.7031 51.8125 37.2656
Length 4, alignment 0/ 0: 50.1094 54.5 66.7344
Length 4, alignment 2/ 0: 72.0312 77.0156 65.1719
Length 8, alignment 0/ 0: 72.6406 76.4531 74.125
Length 8, alignment 3/ 0: 80.9375 84.2969 74.125
Or with N32:
Original:
memcpy builtin_memcpy simple_memcpy
Length 1, alignment 0/ 0: 57.7188 52.5156 35.687
Length 4, alignment 0/ 0: 66.1719 75.9531 63.4531
Length 4, alignment 2/ 0: 66.7344 75.4531 64.1719
Length 8, alignment 0/ 0: 76.7656 85.5469 72.625
Length 8, alignment 3/ 0: 75.6094 84.9062 73.7031
New:
memcpy builtin_memcpy simple_memcpy
Length 1, alignment 0/ 0: 64.3594 54.2344 35.4219
Length 4, alignment 0/ 0: 49.125 59.3281 64.7031
Length 4, alignment 2/ 0: 74.5469 77.3906 63.6562
Length 8, alignment 0/ 0: 57.25 69.0312 73.2188
Length 8, alignment 3/ 0: 94.5 97.9688 73.7031
I have the complete benchmark runs if anyone wants them, but this
shows you the overall pattern. I also ran the correctness tests
and verified that there are no regressions in correctness.
OK to checkin?
Steve Ellcey
sellcey@imgtec.com
2015-10-16 Steve Ellcey <sellcey@imgtec.com>
* sysdeps/mips/memcpy.S (memcpy): Add word copies for small aligned
data.
diff --git a/sysdeps/mips/memcpy.S b/sysdeps/mips/memcpy.S
index c85935b..6f63405 100644
--- a/sysdeps/mips/memcpy.S
+++ b/sysdeps/mips/memcpy.S
@@ -295,7 +295,7 @@ L(memcpy):
* size, copy dst pointer to v0 for the return value.
*/
slti t2,a2,(2 * NSIZE)
- bne t2,zero,L(lastb)
+ bne t2,zero,L(lasts)
#if defined(RETURN_FIRST_PREFETCH) || defined(RETURN_LAST_PREFETCH)
move v0,zero
#else
@@ -546,7 +546,7 @@ L(chkw):
*/
L(chk1w):
andi a2,t8,(NSIZE-1) /* a2 is the reminder past one (d)word chunks */
- beq a2,t8,L(lastb)
+ beq a2,t8,L(lastw)
PTR_SUBU a3,t8,a2 /* a3 is count of bytes in one (d)word chunks */
PTR_ADDU a3,a0,a3 /* a3 is the dst address after loop */
@@ -558,6 +558,20 @@ L(wordCopy_loop):
bne a0,a3,L(wordCopy_loop)
C_ST REG3,UNIT(-1)(a0)
+/* If we have been copying double words, see if we can copy a single word
+ before doing byte copies. We can have, at most, one word to copy. */
+
+L(lastw):
+#ifdef USE_DOUBLE
+ andi t8,a2,3 /* a2 is the remainder past 4 byte chunks. */
+ beq t8,a2,L(lastb)
+ lw REG3,0(a1)
+ sw REG3,0(a0)
+ PTR_ADDIU a0,a0,4
+ PTR_ADDIU a1,a1,4
+ move a2,t8
+#endif
+
/* Copy the last 8 (or 16) bytes */
L(lastb):
blez a2,L(leave)
@@ -572,6 +586,33 @@ L(leave):
j ra
nop
+/* We jump here with a memcpy of less than 8 or 16 bytes, depending on
+ whether or not USE_DOUBLE is defined. Instead of just doing byte
+ copies, check the alignment and size and use lw/sw if possible.
+ Otherwise, do byte copies. */
+
+L(lasts):
+ andi t8,a2,3
+ beq t8,a2,L(lastb)
+
+ andi t9,a0,3
+ bne t9,zero,L(lastb)
+ andi t9,a1,3
+ bne t9,zero,L(lastb)
+
+ PTR_SUBU a3,a2,t8
+ PTR_ADDU a3,a0,a3
+
+L(wcopy_loop):
+ lw REG3,0(a1)
+ PTR_ADDIU a0,a0,4
+ PTR_ADDIU a1,a1,4
+ bne a0,a3,L(wcopy_loop)
+ sw REG3,-4(a0)
+
+ b L(lastb)
+ move a2,t8
+
#ifndef R6_CODE
/*
* UNALIGNED case, got here with a3 = "negu a0"