This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]




On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:


On 09/05/2016 11:15, Stefan Liebler wrote:
On 05/05/2016 06:36 PM, H.J. Lu wrote:
On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:


On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:

On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:


On 05/05/2016 10:37, H.J. Lu wrote:
On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:


On 04/05/2016 17:51, Wilco Dijkstra wrote:
Adhemerval Zanella wrote:

But my point is all the architectures which provide an optimized mempcpy is
though either 1. jump directly to optimized memcpy (s390 case for this patchset),
2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
3. using a similar strategy for both implementations (powerpc).

Indeed, which of those are used doesn't matter much.

So for this change I am proposing compiler support won't be required because both
memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
memcpy is fast as mempcpy implementation I think there is no need to just add
this micro-optimization to only s390, but rather make is general.

GLIBC already has this optimization in the generic string header, it's just that s390 wants
to do something different again. As long as GCC isn't fixed this isn't possible to support
s390 without this header workaround. And we need GCC to improve so things work
better for all the other C libraries...

But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
ports.

Please don't enable it for x86.  Calling memcpy means we have to
save and restore 2 registers for no good reasons.

Yes, direct call will require save and restore the size for further add
and this is true for most architectures.  My question is if does this
really matter in currently GLIBC internal usage and on programs that
might use it compared against the burden of keeping the various
string*.h header in check for multiple architectures or adding this
logic (mempcpy transformation to memcpy) on compiler.

What burden? There is nothing to do in glibc for x86.  GCC can
inline mempcpy for x86.

In fact I am objecting all the bits GLIBC added on string*.h that only adds complexity for some micro-optimizations. For x86 I do agree that transforming mempcpy to memcpy is no the best strategy.

My rationale is to avoid add even more arch-specific bits in installed headers to add such optimizations.

I believe most of those micro-optimizations belong to GCC, not glibc.
Of course, we should keep the existing ones for older GCCs.  We
should avoid adding new ones.


Does this mean, the proposed way is to not add a macro for mempcpy in sysdeps/s390/bits/string.h or e.g. for another architecture?

If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will always use the macro defined in string/string.h, which inlines memcpy + n and the mempcpy-function is not called. The memcpy is transformed to mvc, mvhi for e.g. constant lengths. Otherwise, the memcpy-function is called and the length will be added after the call.

According to PR70140 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in string/string.h will be removed after this bug is fixed in GCC?
Then I think the decision, if mempcpy or memcpy is called or an inline version is emitted, will be done per architecture backend?

If this is all true, then the mempcpy-function will be called in future if it makes sense!?


If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then mempcpy-function is always called - even for cases with constant length where mvc or mvhi is emitted.
This is definitely not the intention of this patch.
After fixing PR70140, GCC will correctly handle the constant length cases!?

What I am proposing is to avoid add more arch-specific optimization on installed
string*.h headers and instead work on adding them on compiler side.  My view is
we should cleanup as much as possible the string headers and only add optimization
that are more architecture neutral.

Related to patch, my understanding is s390x does not really provide an optimized
mempcpy (it uses the default mempcpy.c) and I think a better approach would be
to add an optimized mempcpy like x88_64 (c365e615f7429aee302f8af7bf07ae262278febb).
The mempcpy won't be optimized directly to mvc instruction, but at least it will
call the optimized memcpy.  The full optimization will be done by correctly
handling the transformation on compiler size (the PR#70140 as you referred).



Okay. Here is the updated patch. It implements mempcpy with memcpy as before, but does not change the s390-specific string.h and _HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment mempcpy() won't be called, but instead memcpy + n is inlined by the mempcpy macro in string/string.h. After fxing PR70140, either the mempcpy macro in string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to be defined for S390.

Okay to commit?
commit 4017e270314135561e309170ddb547857f9cc16b
Author: Stefan Liebler <stli@linux.vnet.ibm.com>
Date:   Tue Apr 26 13:11:23 2016 +0200

    S390: Implement mempcpy with help of memcpy. [BZ #19765]
    
    There exist optimized memcpy functions on s390, but no optimized mempcpy.
    This patch adds mempcpy entry points in memcpy.S files, which
    use the memcpy implementation. Now mempcpy itself is also an IFUNC function
    as memcpy is and the variants are listed in ifunc-impl-list.c.
    
    The s390 string.h does not define _HAVE_STRING_ARCH_mempcpy.
    Instead mempcpy string/string.h inlines memcpy() + n.
    If n is constant and small enough, GCC emits instructions like mvi or mvc
    and avoids the function call to memcpy.
    If n is not constant, then memcpy is called and n is added afterwards.
    If _HAVE_STRING_ARCH_mempcpy would be defined, mempcpy would be called in
    every case.
    
    According to PR70140 "Inefficient expansion of __builtin_mempcpy"
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140) GCC should handle a
    call to mempcpy in the same way as memcpy. Then either the mempcpy macro
    in string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to
    be defined for S390.
    
    ChangeLog:
    
    	[BZ #19765]
    	* sysdeps/s390/mempcpy.S: New File.
    	* sysdeps/s390/multiarch/mempcpy.c: Likewise.
    	* sysdeps/s390/multiarch/Makefile (sysdep_routines): Add mempcpy.
    	* sysdeps/s390/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
    	Add mempcpy variants.
    	* sysdeps/s390/s390-32/memcpy.S: Add mempcpy entry point.
    	(memcpy): Adjust to be usable from mempcpy entry point.
    	(__memcpy_mvcle): Likewise.
    	* sysdeps/s390/s390-64/memcpy.S: Likewise.
    	* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: Add entry points
    	____mempcpy_z196, ____mempcpy_z10 and add __GI_ symbols for mempcpy.
    	(__memcpy_z196): Adjust to be usable from mempcpy entry point.
    	(__memcpy_z10): Likewise.
    	* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: Likewise.

diff --git a/sysdeps/s390/mempcpy.S b/sysdeps/s390/mempcpy.S
new file mode 100644
index 0000000..c62074b
--- /dev/null
+++ b/sysdeps/s390/mempcpy.S
@@ -0,0 +1,19 @@
+/* CPU specific mempcpy without multiarch - 32/64 bit S/390 version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* mempcpy is implemented in memcpy.S.  */
diff --git a/sysdeps/s390/multiarch/Makefile b/sysdeps/s390/multiarch/Makefile
index 0805b07..89324ca 100644
--- a/sysdeps/s390/multiarch/Makefile
+++ b/sysdeps/s390/multiarch/Makefile
@@ -18,7 +18,8 @@ sysdep_routines += strlen strlen-vx strlen-c \
 		   memchr memchr-vx \
 		   rawmemchr rawmemchr-vx rawmemchr-c \
 		   memccpy memccpy-vx memccpy-c \
-		   memrchr memrchr-vx memrchr-c
+		   memrchr memrchr-vx memrchr-c \
+		   mempcpy
 endif
 
 ifeq ($(subdir),wcsmbs)
diff --git a/sysdeps/s390/multiarch/ifunc-impl-list.c b/sysdeps/s390/multiarch/ifunc-impl-list.c
index 62a4359..89898ec 100644
--- a/sysdeps/s390/multiarch/ifunc-impl-list.c
+++ b/sysdeps/s390/multiarch/ifunc-impl-list.c
@@ -69,6 +69,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			      S390_IS_Z10 (stfle_bits), __memcpy_z10)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_default))
 
+  IFUNC_IMPL (i, name, mempcpy,
+	      IFUNC_IMPL_ADD (array, i, mempcpy,
+			      S390_IS_Z196 (stfle_bits), ____mempcpy_z196)
+	      IFUNC_IMPL_ADD (array, i, mempcpy,
+			      S390_IS_Z10 (stfle_bits), ____mempcpy_z10)
+	      IFUNC_IMPL_ADD (array, i, mempcpy, 1, ____mempcpy_default))
+
 #endif /* SHARED */
 
 #ifdef HAVE_S390_VX_ASM_SUPPORT
diff --git a/sysdeps/s390/multiarch/mempcpy.c b/sysdeps/s390/multiarch/mempcpy.c
new file mode 100644
index 0000000..34d8329
--- /dev/null
+++ b/sysdeps/s390/multiarch/mempcpy.c
@@ -0,0 +1,26 @@
+/* Multiple versions of mempcpy.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if defined SHARED && IS_IN (libc)
+# include <ifunc-resolve.h>
+s390_libc_ifunc (__mempcpy)
+
+__asm__ (".weak mempcpy\n\t"
+	 ".set mempcpy,__mempcpy\n\t");
+#endif
diff --git a/sysdeps/s390/s390-32/memcpy.S b/sysdeps/s390/s390-32/memcpy.S
index 2ac51ab..6be5104 100644
--- a/sysdeps/s390/s390-32/memcpy.S
+++ b/sysdeps/s390/s390-32/memcpy.S
@@ -25,12 +25,23 @@
      %r3 = address of source memory area
      %r4 = number of bytes to copy.  */
 
-#ifdef USE_MULTIARCH
-ENTRY(__memcpy_default)
-#else
-ENTRY(memcpy)
+       .text
+ENTRY(__mempcpy)
+	.machine "g5"
+	lr      %r1,%r2             # Use as dest
+	la      %r2,0(%r4,%r2)      # Return dest + n
+	j	.L_G5_start
+END(__mempcpy)
+#ifndef USE_MULTIARCH
+libc_hidden_def (__mempcpy)
+weak_alias (__mempcpy, mempcpy)
+libc_hidden_builtin_def (mempcpy)
 #endif
+
+ENTRY(memcpy)
 	.machine "g5"
+	lr      %r1,%r2             # r1: Use as dest ; r2: Return dest
+.L_G5_start:
 	st      %r13,52(%r15)
 	.cfi_offset 13, -44
 	basr    %r13,0
@@ -41,14 +52,13 @@ ENTRY(memcpy)
 	lr      %r5,%r4
 	srl     %r5,8
 	ltr     %r5,%r5
-	lr      %r1,%r2
 	jne     .L_G5_13
 	ex      %r4,.L_G5_17-.L_G5_16(%r13)
 .L_G5_4:
 	l       %r13,52(%r15)
 	br      %r14
 .L_G5_13:
-	chi	%r5,4096             # Switch to mvcle for copies >1MB
+	chi	%r5,4096            # Switch to mvcle for copies >1MB
 	jh	__memcpy_mvcle
 .L_G5_12:
 	mvc     0(256,%r1),0(%r3)
@@ -59,24 +69,24 @@ ENTRY(memcpy)
 	j       .L_G5_4
 .L_G5_17:
 	mvc     0(1,%r1),0(%r3)
-#ifdef USE_MULTIARCH
-END(__memcpy_default)
-#else
 END(memcpy)
+#ifndef USE_MULTIARCH
 libc_hidden_builtin_def (memcpy)
 #endif
 
 ENTRY(__memcpy_mvcle)
-       # Using as standalone function will result in unexpected
-       # results since the length field is incremented by 1 in order to
-       # compensate the changes already done in the functions above.
-       ahi     %r4,1               # length + 1
-       lr      %r5,%r4             # source length
-       lr      %r4,%r3             # source address
-       lr      %r3,%r5             # destination length = source length
+	# Using as standalone function will result in unexpected
+	# results since the length field is incremented by 1 in order to
+	# compensate the changes already done in the functions above.
+	lr      %r0,%r2             # backup return dest [ + n ]
+	ahi     %r4,1               # length + 1
+	lr      %r5,%r4             # source length
+	lr      %r4,%r3             # source address
+	lr      %r2,%r1             # destination address
+	lr      %r3,%r5             # destination length = source length
 .L_MVCLE_1:
-       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
-       jo      .L_MVCLE_1
-       lr      %r2,%r1             # return destination address
-       br      %r14
+	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
+	jo      .L_MVCLE_1
+	lr      %r2,%r0             # return destination address
+	br      %r14
 END(__memcpy_mvcle)
diff --git a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
index 92ffaea..297a894 100644
--- a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
+++ b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
@@ -29,14 +29,23 @@
 
 #if defined SHARED && IS_IN (libc)
 
+ENTRY(____mempcpy_z196)
+	.machine "z196"
+	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z196_start
+END(____mempcpy_z196)
+
 ENTRY(__memcpy_z196)
 	.machine "z196"
 	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z196_start:
 	llgfr   %r4,%r4
 	ltgr    %r4,%r4
 	je      .L_Z196_4
 	aghi    %r4,-1
-	lr      %r1,%r2
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
 	jne     .L_Z196_5
@@ -60,13 +69,22 @@ ENTRY(__memcpy_z196)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z196)
 
+ENTRY(____mempcpy_z10)
+	.machine "z10"
+	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z10_start
+END(____mempcpy_z10)
+
 ENTRY(__memcpy_z10)
 	.machine "z10"
 	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z10_start:
 	llgfr   %r4,%r4
 	cgije   %r4,0,.L_Z10_4
 	aghi    %r4,-1
-	lr      %r1,%r2
 	srlg    %r5,%r4,8
 	cgijlh  %r5,0,.L_Z10_13
 .L_Z10_3:
@@ -88,14 +106,23 @@ ENTRY(__memcpy_z10)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z10)
 
+# define __mempcpy ____mempcpy_default
 #endif /* SHARED && IS_IN (libc) */
 
+#define memcpy __memcpy_default
 #include "../memcpy.S"
+#undef memcpy
 
 #if defined SHARED && IS_IN (libc)
 .globl   __GI_memcpy
 .set     __GI_memcpy,__memcpy_default
+.globl   __GI_mempcpy
+.set     __GI_mempcpy,____mempcpy_default
+.globl   __GI___mempcpy
+.set     __GI___mempcpy,____mempcpy_default
 #else
 .globl   memcpy
 .set     memcpy,__memcpy_default
+.weak    mempcpy
+.set     mempcpy,__mempcpy
 #endif
diff --git a/sysdeps/s390/s390-64/memcpy.S b/sysdeps/s390/s390-64/memcpy.S
index 9d60a14..e3ace0c 100644
--- a/sysdeps/s390/s390-64/memcpy.S
+++ b/sysdeps/s390/s390-64/memcpy.S
@@ -27,19 +27,27 @@
 
 
        .text
+ENTRY(__mempcpy)
+	.machine "z900"
+	lgr     %r1,%r2             # Use as dest
+	la      %r2,0(%r4,%r2)      # Return dest + n
+	j	.L_Z900_start
+END(__mempcpy)
+#ifndef USE_MULTIARCH
+libc_hidden_def (__mempcpy)
+weak_alias (__mempcpy, mempcpy)
+libc_hidden_builtin_def (mempcpy)
+#endif
 
-#ifdef USE_MULTIARCH
-ENTRY(__memcpy_default)
-#else
 ENTRY(memcpy)
-#endif
 	.machine "z900"
+	lgr     %r1,%r2             # r1: Use as dest ; r2: Return dest
+.L_Z900_start:
 	ltgr    %r4,%r4
 	je      .L_Z900_4
 	aghi    %r4,-1
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
-	lgr     %r1,%r2
 	jne     .L_Z900_13
 .L_Z900_3:
 	larl    %r5,.L_Z900_15
@@ -57,25 +65,24 @@ ENTRY(memcpy)
 	j       .L_Z900_3
 .L_Z900_15:
 	mvc     0(1,%r1),0(%r3)
-
-#ifdef USE_MULTIARCH
-END(__memcpy_default)
-#else
 END(memcpy)
+#ifndef USE_MULTIARCH
 libc_hidden_builtin_def (memcpy)
 #endif
 
 ENTRY(__memcpy_mvcle)
-       # Using as standalone function will result in unexpected
-       # results since the length field is incremented by 1 in order to
-       # compensate the changes already done in the functions above.
-       aghi    %r4,1               # length + 1
-       lgr     %r5,%r4             # source length
-       lgr     %r4,%r3             # source address
-       lgr     %r3,%r5             # destination length = source length
+	# Using as standalone function will result in unexpected
+	# results since the length field is incremented by 1 in order to
+	# compensate the changes already done in the functions above.
+	lgr     %r0,%r2             # backup return dest [ + n ]
+	aghi    %r4,1               # length + 1
+	lgr     %r5,%r4             # source length
+	lgr     %r4,%r3             # source address
+	lgr     %r2,%r1             # destination address
+	lgr     %r3,%r5             # destination length = source length
 .L_MVCLE_1:
-       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
-       jo      .L_MVCLE_1
-       lgr     %r2,%r1             # return destination address
-       br      %r14
+	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
+	jo      .L_MVCLE_1
+	lgr     %r2,%r0             # return destination address
+	br      %r14
 END(__memcpy_mvcle)
diff --git a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
index 8f54526..0f5a36e 100644
--- a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
+++ b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
@@ -29,12 +29,20 @@
 
 #if defined SHARED && IS_IN (libc)
 
+ENTRY(____mempcpy_z196)
+	.machine "z196"
+	lgr     %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z196_start
+END(____mempcpy_z196)
+
 ENTRY(__memcpy_z196)
 	.machine "z196"
+	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z196_start:
 	ltgr    %r4,%r4
 	je      .L_Z196_4
 	aghi    %r4,-1
-	lgr     %r1,%r2
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
 	jne     .L_Z196_5
@@ -58,11 +66,19 @@ ENTRY(__memcpy_z196)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z196)
 
+ENTRY(____mempcpy_z10)
+	.machine "z10"
+	lgr     %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z10_start
+END(____mempcpy_z10)
+
 ENTRY(__memcpy_z10)
 	.machine "z10"
+	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z10_start:
 	cgije   %r4,0,.L_Z10_4
 	aghi    %r4,-1
-	lgr     %r1,%r2
 	srlg    %r5,%r4,8
 	cgijlh  %r5,0,.L_Z10_13
 .L_Z10_3:
@@ -84,14 +100,23 @@ ENTRY(__memcpy_z10)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z10)
 
+# define __mempcpy ____mempcpy_default
 #endif /* SHARED && IS_IN (libc) */
 
+#define memcpy __memcpy_default
 #include "../memcpy.S"
+#undef memcpy
 
 #if defined SHARED && IS_IN (libc)
 .globl   __GI_memcpy
 .set     __GI_memcpy,__memcpy_default
+.globl   __GI_mempcpy
+.set     __GI_mempcpy,____mempcpy_default
+.globl   __GI___mempcpy
+.set     __GI___mempcpy,____mempcpy_default
 #else
 .globl   memcpy
 .set     memcpy,__memcpy_default
+.weak    mempcpy
+.set     mempcpy,__mempcpy
 #endif

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]