This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] mips/o32: fix internal_syscall5/6/7



On 17/08/2017 13:17, Maciej W. Rozycki wrote:
> On Wed, 16 Aug 2017, Joseph Myers wrote:
> 
>>>  If the answer to any of these questions is "yes", then would factoring 
>>> out the syscall `asm' along with the associated VLA declaration to a 
>>> helper `always_inline' function help or would it not?
>>
>> I don't think that would help.  An asm can never make assumptions about 
>> which parts of the stack are used for what, only use its operands.
> 
>  There may be ABI restrictions however, which could provide guarantees 
> beyond those resulting from the lone `asm' operands.  And it would be 
> enough if we could prove that a certain arrangement has to be done in 
> order not to break the ABI.  I can't think of anything right now though 
> and if neither you nor anyone else can, then we'll have to live with what 
> we have right now.
> 
>>>  I mean it is a tiny optimisation, but some syscalls are frequently 
>>> called, so if we can avoid a waste of resources, then why not?
>>
>> Are any 5/6/7-argument syscalls frequently called?
> 
>  Good question, however I have no data available.
> 
>  Anyway, here's my counter-proposal implementing the approach previously 
> outlined.  I have passed it through regular MIPS o32 testing with these 
> changes in test outputs resulting:
> 
> @@ -2575,7 +2575,7 @@
>  PASS: nptl/tst-cond22
>  PASS: nptl/tst-cond23
>  PASS: nptl/tst-cond24
> -FAIL: nptl/tst-cond25
> +PASS: nptl/tst-cond25
>  PASS: nptl/tst-cond3
>  PASS: nptl/tst-cond4
>  PASS: nptl/tst-cond5
> @@ -2704,7 +2704,7 @@
>  PASS: nptl/tst-rwlock12
>  PASS: nptl/tst-rwlock13
>  PASS: nptl/tst-rwlock14
> -FAIL: nptl/tst-rwlock15
> +PASS: nptl/tst-rwlock15
>  PASS: nptl/tst-rwlock16
>  PASS: nptl/tst-rwlock17
>  PASS: nptl/tst-rwlock18
> 
>  The drawback is it adds a bit to code generated, e.g. `__libc_pwrite' 
> (from nptl/pwrite.o and nptl/pwrite.os) grows by 4 and 6 instructions 
> respectively for non-PIC and PIC code respectively, and the whole 
> libraries:
> 
>    text    data     bss     dec     hex filename
> 1483315   21129   11560 1516004  1721e4 libc.so
>  105482     960    8448  114890   1c0ca nptl/libpthread.so
> 
> vs:
> 
>    text    data     bss     dec     hex filename
> 1484295   21133   11560 1516988  1725bc libc.so
>  105974     960    8448  115382   1c2b6 nptl/libpthread.so
> 
> due to the insertion of the VLA size calculation (although GCC is smart 
> enough to reuse a value of 0 already available, e.g.:
> 
>   38:	7c03e83b 	rdhwr	v1,$29
>   3c:	8c638b70 	lw	v1,-29840(v1)
>   40:	14600018 	bnez	v1,a4 <__libc_pwrite+0xa4>
>   44:	000787c3 	sra	s0,a3,0x1f
>   48:	000318c0 	sll	v1,v1,0x3
>   4c:	03a08825 	move	s1,sp
>   50:	03a3e823 	subu	sp,sp,v1
> 
> and save an isntruction) and the use of an extra register to preserve the 
> value of $sp across the block containing the VLA (as also seen with $s1 in 
> the disassembly above) even though it could use $fp that holds the same 
> value instead (e.g. continuing from the above:
> 
>   74:	0220e825 	move	sp,s1
>   78:	03c0e825 	move	sp,s8
> 
> ).  It would be good to know how this compares to Adhemerval's proposal.

My point is I think we should aim for compiler optimization safeness
(to avoid code breakage over compiler defined default flags) and taking
as base current approach to *avoid* VLA on GLIBC I do not think it is
good approach to use it as a bridge to force GCC to generate the expected
code.

I still thinking trying to optimize for 5/6/7 syscall argument is over
engineering in this *specific* case.  As I put in my last message,
5/6/7 argument syscalls are used for 

pread, pwrite, lseek, llseek, ppoll, posix_fadvice, posix_fallocate, 
sync_file_range, fallocate, preadv, pwritev, preadv2, pwritev2, select,
pselect, mmap, readahead, epoll_pwait, splice, recvfrom, sendto, recvmmsg,
msgsnd, msgrcv, msgget, msgctl, semop, semget, semctl, semtimedop, shmat,
shmdt, shmget, and shmctl. 

Which are the one generated from C implementation (some are still auto
generated).  The majority of them are blocking syscalls, so both context
switch plus the required work for syscall completion itself will taking
proportionally all the required time.  So trying to squeeze some cycles
don't really pay off comparing to code maintainability (just all this
discussion of which C construct would be safe enough to generate the 
correct stack spill plus the current issue should indicate we should
aim for correctness first).
 


> 
>   Maciej
> 
> 	* sysdeps/unix/sysv/linux/mips/mips32/sysdep.h 
> 	(FORCE_FRAME_POINTER): Remove macro.
> 	(internal_syscall5): Use a variable-length array to force the
> 	use of a frame pointer.
> 	(internal_syscall6): Likewise.
> 	(internal_syscall7): Likewise.
> ---
>  sysdeps/unix/sysv/linux/mips/mips32/sysdep.h |   24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> glibc-mips-o32-syscall-stack.diff
> Index: glibc/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h
> ===================================================================
> --- glibc.orig/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h	2017-04-11 21:27:25.000000000 +0100
> +++ glibc/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h	2017-08-16 20:49:15.758029215 +0100
> @@ -264,18 +264,20 @@
>  
>  /* We need to use a frame pointer for the functions in which we
>     adjust $sp around the syscall, or debug information and unwind
> -   information will be $sp relative and thus wrong during the syscall.  As
> -   of GCC 4.7, this is sufficient.  */
> -#define FORCE_FRAME_POINTER						\
> -  void *volatile __fp_force __attribute__ ((unused)) = alloca (4)
> +   information will be $sp relative and thus wrong during the syscall.
> +   We use a variable-length array to persuade GCC to use $fp.  */
>  
>  #define internal_syscall5(v0_init, input, number, err,			\
>  			  arg1, arg2, arg3, arg4, arg5)			\
>  ({									\
>  	long _sys_result;						\
>  									\
> -	FORCE_FRAME_POINTER;						\
> +	size_t s = 0;							\
> +	asm ("" : "+r" (s));						\
>  	{								\
> +	char vla[s << 3];						\
> +	asm ("" : : "p" (vla));						\
> +									\
>  	register long __s0 asm ("$16") __attribute__ ((unused))		\
>  	  = (number);							\
>  	register long __v0 asm ("$2");					\
> @@ -306,8 +308,12 @@
>  ({									\
>  	long _sys_result;						\
>  									\
> -	FORCE_FRAME_POINTER;						\
> +	size_t s = 0;							\
> +	asm ("" : "+r" (s));						\
>  	{								\
> +	char vla[s << 3];						\
> +	asm ("" : : "p" (vla));						\
> +									\
>  	register long __s0 asm ("$16") __attribute__ ((unused))		\
>  	  = (number);							\
>  	register long __v0 asm ("$2");					\
> @@ -339,8 +345,12 @@
>  ({									\
>  	long _sys_result;						\
>  									\
> -	FORCE_FRAME_POINTER;						\
> +	size_t s = 0;							\
> +	asm ("" : "+r" (s));						\
>  	{								\
> +	char vla[s << 3];						\
> +	asm ("" : : "p" (vla));						\
> +									\
>  	register long __s0 asm ("$16") __attribute__ ((unused))		\
>  	  = (number);							\
>  	register long __v0 asm ("$2");					\
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]