Bug 10354 - posix_spawn should use vfork() in more cases than presently
Summary: posix_spawn should use vfork() in more cases than presently
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: 2.24
Assignee: Ulrich Drepper
Depends on: 20178
  Show dependency treegraph
Reported: 2009-06-30 23:45 UTC by Martin Buchholz
Modified: 2016-05-30 14:39 UTC (History)
7 users (show)

See Also:
Last reconfirmed:
fweimer: security-


Note You need to log in before you can comment on or make changes to this bug.
Description Martin Buchholz 2009-06-30 23:45:24 UTC
glibc posix_spawn uses vfork() in some cases, fork() in others.
Currently it is rather conservative in this regard.
For example, if there are any file actions, vfork() is avoided.
This restriction can be lifted, I think,
especially for the common case of closing file descriptors.
Comment 1 Ulrich Drepper 2009-10-30 06:32:11 UTC
No.  All file operations are cancellation points.  This cannot be done after vfork.
Comment 2 Rich Felker 2011-10-10 03:08:20 UTC
The excuse for not fixing this does not make sense. The syscalls could be used directly (without the wrapper code for cancellation), or cancellation could simply be disabled by the parent prior to vfork and reenabled by the parent after vfork returns (which won't happen until the child calls exec).
Comment 3 Ulrich Drepper 2011-10-15 13:47:29 UTC
(In reply to comment #2)
> The excuse for not fixing this does not make sense.

What doesn't make sense is you.  The function is defined in terms of using close, open etc.  The semantics is fixed.
Comment 4 Rich Felker 2011-10-15 20:52:20 UTC
It's defined in terms of open, close, etc. happening in the context of the child process, where cancellation cannot play any role since there's no way to obtain a pthread_t value referring to a thread in the child. Of course implementing posix_spawn using vfork causes the cancellation state to be shared with the parent, which would be problematic unless special care is made to protect the state. I suggested several ways this could be done to allow vfork to be used for posix_spawn in all cases.
Comment 5 Rich Felker 2012-10-21 16:59:58 UTC
This bug was marked "fixed" for no apparent reason. The issues reported still exist; vfork is not used if any non-default attributes or file actions are present. This makes it impossible to use posix_spawn from extremely large processes due to commit charge exhaustion.
Comment 6 Jackie Rosen 2014-02-16 17:45:53 UTC Comment hidden (spam)
Comment 7 Pádraig Brady 2014-05-27 22:36:13 UTC
It's important that posix_spawn() avoids fork() where possible as that's the whole reason for having this interface. I see that one can force using
vfork() with the POSIX_SPAWN_USEVFORK flag as mentioned in bug #378
though then you're on your own wrt safety.

So I'm wondering what specifically callers of posix_spawn() would
have to be careful to do?
  * Not setup pthread_atfork() handlers
  * pthread_setcancelstate(PTHREAD_CANCEL_DISABLE,  NULL); first (and enable after)
  * anything else?

Also I was wondering whether some restrictions could be lifted in
the implementation by using clone() rather than vfork().
I.E. vfork() where possible to support MMU less systems,
but then fall back to clone(..., CLONE_VM | SIGCHLD, ...);
to at least have an otherwise efficient implementation.
Comment 8 Rich Felker 2014-05-28 04:24:02 UTC
Using clone with CLONE_VM and a new stack/start-function solves most of the insurmountable problems with vfork that make it unsafe, but there are still a number of considerations that come from sharing memory. In particular, you must ensure that no signal handlers run in the child while it's sharing memory with the parent, and the parent must be responsible for freeing the child's stack, and thus must synchronize with the execve call in the child.

I have a working implementation in musl which solves these problems (it uses clone with CLONE_VM for all posix_spawn operations) if anyone is interested in looking at it and doing something similar for glibc.
Comment 9 Carlos O'Donell 2014-09-20 04:22:33 UTC
(In reply to Rich Felker from comment #8)
> Using clone with CLONE_VM and a new stack/start-function solves most of the
> insurmountable problems with vfork that make it unsafe, but there are still
> a number of considerations that come from sharing memory. In particular, you
> must ensure that no signal handlers run in the child while it's sharing
> memory with the parent, and the parent must be responsible for freeing the
> child's stack, and thus must synchronize with the execve call in the child.
> I have a working implementation in musl which solves these problems (it uses
> clone with CLONE_VM for all posix_spawn operations) if anyone is interested
> in looking at it and doing something similar for glibc.

I agree this issue should be fixed.

posix_spawn() could use vfork() in many more cases than it does today, and cancellation is no excuse not to fix this. I agree that cancellation in the context of the child doesn't make any sense. The child needs to just not touch the cancel state, and with signals blocked it will never act on a SIGCANCEL request anyway.

We can perhaps revisit this after we fix cancellation to work correctly.
Comment 10 Pádraig Brady 2015-05-25 19:57:45 UTC
preliminary timing of @musllibc's posix_spawn vs fork+exec shows
it ~25x faster for large parent processes. (~360µs vs 9ms for 40MB in small mappings). For a trivial staic binary the difference is 280µs vs 450µs
Comment 11 Rich Felker 2015-05-25 20:18:48 UTC
Comment 10 is quoting some of my tweets, starting with: https://twitter.com/RichFelker/status/602313644026761216
Comment 12 cvs-commit@gcc.gnu.org 2016-03-07 04:59:12 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  9ff72da471a509a8c19791efe469f47fa6977410 (commit)
       via  1eb8930608705702d5746e5491bab4e4429fcb83 (commit)
       via  f83bb9b8e97656ae0d3e2a31e859363e2d4d5832 (commit)
      from  fee9eb6200f0e44a4b684903bc47fde36d46f1a5 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------

commit 9ff72da471a509a8c19791efe469f47fa6977410
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Tue Jan 19 17:33:32 2016 -0200

    posix: New Linux posix_spawn{p} implementation
    This patch implements a new posix_spawn{p} implementation for Linux.  The main
    difference is it uses the clone syscall directly with CLONE_VM and CLONE_VFORK
    flags and a direct allocated stack.  The new stack and start function solves
    most the vfork limitation (possible parent clobber due stack spilling).  The
    remaning issue are related to signal handling:
      1. That no signal handlers must run in child context, to avoid corrupt
         parent's state.
      2. Child must synchronize with parent to enforce stack deallocation and
         to possible return execv issues.
    The first one is solved by blocking all signals in child, even NPTL-internal
    ones (SIGCANCEL and SIGSETXID).  The second issue is done by a stack allocation
    in parent and a synchronization with using a pipe or waitpid (in case or error).
    The pipe has the advantage of allowing the child signal an exec error (checked
    with new tst-spawn2 test).
    There is an inherent race condition in pipe2 usage for architectures that do not
    support the syscall directly.  In such cases the a pipe plus fctnl is used
    instead and it may lead to file descriptor leak in parent (as decribed by fcntl
    The child process stack is allocate with a mmap with MAP_STACK flag using
    default architecture stack size.  Although it is slower than use a stack buffer
    from parent, it allows some slack for the compatibility code to run scripts
    with no shebang (which may use a buffer with size depending of argument list
    Performance should be similar to the vfork default posix implementation and
    way faster than fork path (vfork on mostly linux ports are basically
    clone with CLONE_VM plus CLONE_VFORK).  The only difference is the syscalls
    required for the stack allocation/deallocation.
    It fixes BZ#10354, BZ#14750, and BZ#18433.
    Tested on i386, x86_64, powerpc64le, and aarch64.
    	[BZ #14750]
    	[BZ #10354]
    	[BZ #18433]
    	* include/sched.h (__clone): Add hidden prototype.
    	(__clone2): Likewise.
    	* include/unistd.h (__dup): Likewise.
    	* posix/Makefile (tests): Add tst-spawn2.
    	* posix/tst-spawn2.c: New file.
    	* sysdeps/posix/dup.c (__dup): Add hidden definition.
    	* sysdeps/unix/sysv/linux/aarch64/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/alpha/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/arm/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/hppa/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/i386/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/ia64/clone2.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/m68k/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/microblaze/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/mips/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/nios2/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/powerpc/powerpc32/clone.S (__clone):
    	* sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S (__clone):
    	* sysdeps/unix/sysv/linux/s390/s390-32/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/s390/s390-64/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/sh/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/sparc/sparc32/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/sparc/sparc64/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/tile/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/clone.S (__clone): Likewise.
    	* sysdeps/unix/sysv/linux/nptl-signals.h
    	(____nptl_is_internal_signal): New function.
    	* sysdeps/unix/sysv/linux/spawni.c: New file.


commit 1eb8930608705702d5746e5491bab4e4429fcb83
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Fri Jan 22 09:58:49 2016 -0200

    posix: execvpe cleanup
    This patch removes all the dynamic allocation on execvpe code and
    instead use direct stack allocation.  This is QoI approach to make
    it possible use in scenarios where memory is shared with parent
    (vfork or clone with CLONE_VM).
    For default process spawn (script file without a shebang), stack
    allocation is bounded by NAME_MAX plus PATH_MAX plus 1.  Large
    file arguments returns an error (ENAMETOOLONG).  This differs than
    current GLIBC pratice in general, but it used to limit stack
    allocation for large inputs.  Also, path in PATH environment variable
    larger than PATH_MAX are ignored.
    The shell direct execution exeception, where execve returns ENOEXEC,
    might requires a large stack allocation due large input argument list.
    Tested on i686, x86_64, powerpc64le, and aarch64.
    	* posix/execvpe.c (__execvpe): Remove dynamic allocation.
    	* posix/Makefile (tests): Add tst-execvpe{1,2,3,4,5,6}.
    	* posix/tst-execvp1.c (do_test): Use a macro to call execvp.
    	* posix/tst-execvp2.c (do_test): Likewise.
    	* posix/tst-execvp3.c (do_test): Likewise.
    	* posix/tst-execvp4.c (do_test): Likewise.
    	* posix/tst-execvpe1.c: New file.
    	* posix/tst-execvpe2.c: Likewise.
    	* posix/tst-execvpe3.c: Likewise.
    	* posix/tst-execvpe4.c: Likewise.
    	* posix/tst-execvpe5.c: Likewise.
    	* posix/tst-execvpe6.c: Likewise.


commit f83bb9b8e97656ae0d3e2a31e859363e2d4d5832
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Fri Jan 29 11:43:40 2016 -0200

    posix: Remove dynamic memory allocation from execl{e,p}
    GLIBC execl{e,p} implementation might use malloc if the total number of
    arguments exceed initial assumption size (1024).  This might lead to
    issues in two situations:
    1. execl/execle is stated to be async-signal-safe by POSIX [1].  However
       if execl is used in a signal handler with a large argument set (that
       may call malloc internally) and if the resulting call fails it might
       lead malloc in the program in a bad state.
    2. If the functions are used in a vfork/clone(VFORK) situation it also
       might issue malloc internal bad state.
    This patch fixes it by using stack allocation instead.  It also fixes
    Tested on x86_64.
    [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html
    	[BZ #19534]
    	* posix/execl.c (execl): Remove dynamic memory allocation.
    	* posix/execle.c (execle): Likewise.
    	* posix/execlp.c (execlp): Likewise.


Summary of changes:
 ChangeLog                                         |   54 +++
 include/sched.h                                   |    2 +
 include/unistd.h                                  |    1 +
 posix/Makefile                                    |    5 +-
 posix/execl.c                                     |   68 ++---
 posix/execle.c                                    |   70 ++---
 posix/execlp.c                                    |   66 ++--
 posix/execvpe.c                                   |  255 ++++++--------
 posix/tst-execvp1.c                               |    6 +-
 posix/tst-execvp2.c                               |    5 +-
 posix/tst-execvp3.c                               |    5 +-
 posix/tst-execvp4.c                               |    6 +-
 posix/tst-execvpe1.c                              |   20 +
 posix/tst-execvpe2.c                              |   20 +
 posix/tst-execvpe3.c                              |   20 +
 posix/tst-execvpe4.c                              |   20 +
 posix/tst-execvpe5.c                              |  157 ++++++++
 posix/tst-execvpe6.c                              |  150 ++++++++
 posix/tst-spawn2.c                                |   72 ++++
 sysdeps/posix/dup.c                               |    2 +-
 sysdeps/unix/sysv/linux/aarch64/clone.S           |    1 +
 sysdeps/unix/sysv/linux/alpha/clone.S             |    1 +
 sysdeps/unix/sysv/linux/arm/clone.S               |    1 +
 sysdeps/unix/sysv/linux/hppa/clone.S              |    1 +
 sysdeps/unix/sysv/linux/i386/clone.S              |    1 +
 sysdeps/unix/sysv/linux/ia64/clone2.S             |    2 +
 sysdeps/unix/sysv/linux/m68k/clone.S              |    1 +
 sysdeps/unix/sysv/linux/microblaze/clone.S        |    1 +
 sysdeps/unix/sysv/linux/mips/clone.S              |    1 +
 sysdeps/unix/sysv/linux/nios2/clone.S             |    1 +
 sysdeps/unix/sysv/linux/nptl-signals.h            |   10 +
 sysdeps/unix/sysv/linux/powerpc/powerpc32/clone.S |    1 +
 sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S |    1 +
 sysdeps/unix/sysv/linux/s390/s390-32/clone.S      |    2 +
 sysdeps/unix/sysv/linux/s390/s390-64/clone.S      |    2 +
 sysdeps/unix/sysv/linux/sh/clone.S                |    1 +
 sysdeps/unix/sysv/linux/sparc/sparc32/clone.S     |    1 +
 sysdeps/unix/sysv/linux/sparc/sparc64/clone.S     |    1 +
 sysdeps/unix/sysv/linux/spawni.c                  |  404 +++++++++++++++++++++
 sysdeps/unix/sysv/linux/tile/clone.S              |    1 +
 sysdeps/unix/sysv/linux/x86_64/clone.S            |    1 +
 41 files changed, 1162 insertions(+), 278 deletions(-)
 create mode 100644 posix/tst-execvpe1.c
 create mode 100644 posix/tst-execvpe2.c
 create mode 100644 posix/tst-execvpe3.c
 create mode 100644 posix/tst-execvpe4.c
 create mode 100644 posix/tst-execvpe5.c
 create mode 100644 posix/tst-execvpe6.c
 create mode 100644 posix/tst-spawn2.c
 create mode 100644 sysdeps/unix/sysv/linux/spawni.c
Comment 13 Adhemerval Zanella 2016-03-07 05:37:49 UTC
Fixed by 9ff72da471a509a8c19791efe469f47fa6977410.