Bug 10311

Summary: clone(CLONE_VM) fails with pthread_getattr_np on i386
Product: glibc Reporter: Martin Buchholz <martinrb>
Component: nptlAssignee: Ulrich Drepper <drepper.fsp>
Status: RESOLVED WONTFIX    
Severity: normal CC: bugdal, fweimer, glibc-bugs, lfarkas, nszabolcs
Priority: P2 Flags: fweimer: security-
Version: 2.8   
Target Milestone: ---   
Host: x86_64-unknown-linux-gnu Target:
Build: Last reconfirmed:

Description Martin Buchholz 2009-06-22 19:23:09 UTC
I'm using clone() with flags CLONE_VM, but not CLONE_THREAD.
(background: I'm trying to solve the ancient overcommit failure
when spawning a small Unix process from a big process).

The act of calling clone appears to mess up the pthread library,
but only on i386, not on x86_64, using glibc version 2.7
(The bugzilla Version drop-down does not allow one to specify 2.7;
y'all should fix that)

Here's a shell transcript containing a program 
that demonstrates the problem, and shows that
the problem does not occur when running in 64-bit mode
on 64-bit Linux.  (The problem also occurs when running in 32-bit mode
on 32-bit Linux).

A program like this would be a fine addition to the glibc test suite.

$ set -x; for flag in -m32 -m64; do gcc $flag -lpthread ./clone_bug.c &&
./a.out; done; cat clone_bug.c; uname -a; getconf GNU_LIBPTHREAD_VERSION;
getconf GNU_LIBC_VERSION
+zsh:1464> set -x
+zsh:1464> flag=-m32
+zsh:1464> gcc -m32 -lpthread ./clone_bug.c
+zsh:1464> ./a.out
count=2, pthread_getattr_np failed with errno = "No such process"
+zsh:1464> flag=-m64
+zsh:1464> gcc -m64 -lpthread ./clone_bug.c
+zsh:1464> ./a.out
+zsh:1464> cat clone_bug.c
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <stddef.h>
#include <sys/types.h>
#include <wait.h>
#include <errno.h>
#include <unistd.h>
#include <pthread.h>
#include <syscall.h>
#include <sched.h>

static void
debugPrint(char *format, ...) {
  FILE *tty = fopen("/dev/tty", "w");
  va_list ap;
  va_start(ap, format);
  vfprintf(tty, format, ap);
  va_end(ap);
  fclose(tty);
}

static void debugPids(void) {
//   debugPrint("getpid()=%d gettid()=%d, syscall(getpid)=%d pthread_self=%d\n",
//              getpid(), syscall(SYS_gettid), syscall(SYS_getpid), pthread_self());
  static int count = 0;
  pthread_attr_t attr;
  int result;
  ++count;
  if ((result = pthread_getattr_np(pthread_self(), &attr)) != 0)
    debugPrint("count=%d, pthread_getattr_np failed with errno = \"%s\"\n",
               count, strerror(result));
}

static int childProcess(void *ignored) {
  _exit(0);
  // debugPrint("child\n");
  // execve("/bin/true", NULL, NULL);
  // perror("execve");
}

// I'm sure there's a better way to do this,
// but pthread_join ain't it - we can't trust it.
volatile int done = 0;

void* run(void *x) {
  const int stack_size = 1024 * 1024;
  void *clone_stack = malloc(2 * stack_size);
  int status;
  debugPids();
  int pid = clone(childProcess, clone_stack + stack_size,
                  CLONE_VM | SIGCHLD, NULL);
  waitpid(pid, &status, 0);
  debugPids();
  done = 1;
  pthread_exit(0);
  return NULL;
}

int main(int argc, char *argv[]) {
  pthread_attr_t attr;
  pthread_t tid;

  pthread_attr_init(&attr);
  pthread_create(&tid, &attr, (void* (*)(void*)) run, NULL);
  // pthread_join(tid, NULL);
  while (! done)
    ;
}
+zsh:1464> uname -a
Linux spraggett.mtv.corp.google.com 2.6.24-gg23-generic #1 SMP Fri Jan 30
14:07:49 PST 2009 x86_64 GNU/Linux
+zsh:1464> getconf GNU_LIBPTHREAD_VERSION
NPTL 2.7
+zsh:1464> getconf GNU_LIBC_VERSION
glibc 2.7
Comment 1 Martin Buchholz 2009-06-22 19:32:28 UTC
I'm not sure whether this is
- a glibc bug in the implementation of clone()
- a kernel bug in the implementation of the clone syscall
- or simply an unsupported combination of FLAGS.

The fact that the various versions of clone.S explicitly test for
CLONE_VM and CLONE_THREAD and take different action suggest that 
what I'm doing in the test case should work.

I am suspecting a bug in 
sysdeps/unix/sysv/linux/i386/clone.S
but this is deep magic x86 assembly language,
and I'm not competent to debug it.
Comment 2 Ulrich Drepper 2009-06-22 19:35:54 UTC
If you use clone() you're on your own.
Comment 3 Levente Farkas 2012-09-24 15:04:24 UTC
are you sure that such and attitude are the right way?

is it a valid c code? if yes and it's failed then it's a bug in glibc.
Comment 4 H.J. Lu 2012-09-27 16:14:44 UTC
CLONE_VM is tricky. See PR 11214.
Comment 5 Levente Farkas 2012-09-27 16:32:37 UTC
where ulrich also add such a wonderful and useful comment:-)

my simple question: is the above code correct? if yes it's a glibc bug either implementation or documentation.
Comment 6 Rich Felker 2012-09-27 18:15:42 UTC
The code is not correct. Basically, there's nothing you can safely do with CLONE_VM unless the child restricts itself to pure computation and direct syscalls (via sys/syscall.h). If you use any of the standard library, you risk the parent and child clobbering each other's internal states. You also have issues like the fact that glibc caches the pid/tid in userspace, and the fact that glibc expects to always have a valid thread pointer which your call to clone is unable to initialize correctly because it does not know (and should not know) the internal implementation of threads.

I know my warning not to call even async-signal-safe functions in libc, and to make all syscalls manually, seems extreme, but I don't see any way around it given the above issues. Perhaps glibc could document a set of "clone_vm-safe" functions that can be used in the child after cloning with CLONE_VM without having to worry that they will access internal libc state or need a valid thread pointer.

However, I think it's probably better to just refrain from abusing clone and use pthread_create the way it was intended to be used, possibly with unshare() afterwards if you want some threads to have their own signal/fd/etc. namespaces.
Comment 7 Florian Weimer 2012-09-27 18:28:37 UTC
(In reply to comment #6)
> I know my warning not to call even async-signal-safe functions in libc, and to
> make all syscalls manually, seems extreme, but I don't see any way around it
> given the above issues. Perhaps glibc could document a set of "clone_vm-safe"
> functions that can be used in the child after cloning with CLONE_VM without
> having to worry that they will access internal libc state or need a valid
> thread pointer.

errno and cancellation support require working TLS, so the set of safe functions would be quite small and probably not that useful.

If this functionality is desired, an interface for setting up a working libc environment in the quasi-subprocess would likely be more useful.
Comment 8 Martin Buchholz 2012-09-28 00:57:46 UTC
The GPL code to allow a (big) Java process to start (small) subprocesses
without a momentary doubling of memory requirements can be found here:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/d94613ac03d8/src/solaris/native/java/lang/UNIXProcess_md.c

vfork seems to have worked well for years.  As I wrote:

/*
 * There are 3 possible strategies we might use to "fork":
 *
 * - fork(2).  Very portable and reliable but subject to
 *   failure due to overcommit (see the documentation on
 *   /proc/sys/vm/overcommit_memory in Linux proc(5)).
 *   This is the ancient problem of spurious failure whenever a large
 *   process starts a small subprocess.
 *
 * - vfork().  Using this is scary because all relevant man pages
 *   contain dire warnings, e.g. Linux vfork(2).  But at least it's
 *   documented in the glibc docs and is standardized by XPG4.
 *   http://www.opengroup.org/onlinepubs/000095399/functions/vfork.html
 *   On Linux, one might think that vfork() would be implemented using
 *   the clone system call with flag CLONE_VFORK, but in fact vfork is
 *   a separate system call (which is a good sign, suggesting that
 *   vfork will continue to be supported at least on Linux).
 *   Another good sign is that glibc implements posix_spawn using
 *   vfork whenever possible.  Note that we cannot use posix_spawn
 *   ourselves because there's no reliable way to close all inherited
 *   file descriptors.
 *
 * - clone() with flags CLONE_VM but not CLONE_THREAD.  clone() is
 *   Linux-specific, but this ought to work - at least the glibc
 *   sources contain code to handle different combinations of CLONE_VM
 *   and CLONE_THREAD.  However, when this was implemented, it
 *   appeared to fail on 32-bit i386 (but not 64-bit x86_64) Linux with
 *   the simple program
 *     Runtime.getRuntime().exec("/bin/true").waitFor();
 *   with:
 *     #  Internal Error (os_linux_x86.cpp:683), pid=19940, tid=2934639536
 *     #  Error: pthread_getattr_np failed with errno = 3 (ESRCH)
 *   We believe this is a glibc bug, reported here:
 *     http://sources.redhat.com/bugzilla/show_bug.cgi?id=10311
 *   but the glibc maintainers closed it as WONTFIX.
 *
 * Based on the above analysis, we are currently using vfork() on
 * Linux and fork() on other Unix systems, but the code to use clone()
 * remains.
 */



On Thu, Sep 27, 2012 at 11:28 AM, fweimer at redhat dot com <
sourceware-bugzilla@sourceware.org> wrote:

>
> http://sourceware.org/bugzilla/show_bug.cgi?id=10311
>
> Florian Weimer <fweimer at redhat dot com> changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |fweimer at redhat dot com
>
> --- Comment #7 from Florian Weimer <fweimer at redhat dot com> 2012-09-27
> 18:28:37 UTC ---
> (In reply to comment #6)
> > I know my warning not to call even async-signal-safe functions in libc,
> and to
> > make all syscalls manually, seems extreme, but I don't see any way
> around it
> > given the above issues. Perhaps glibc could document a set of
> "clone_vm-safe"
> > functions that can be used in the child after cloning with CLONE_VM
> without
> > having to worry that they will access internal libc state or need a valid
> > thread pointer.
>
> errno and cancellation support require working TLS, so the set of safe
> functions would be quite small and probably not that useful.
>
> If this functionality is desired, an interface for setting up a working
> libc
> environment in the quasi-subprocess would likely be more useful.
>
> --
> Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You reported the bug.
>
Comment 9 Martin Buchholz 2018-09-10 16:48:53 UTC
2018 update ...

Java has been happily using vfork on Linux, and posix_spawn on other Unix systems for many years. Code to try using clone() on Linux has never gotten past prototype stage and we have no plans to try again.  Documentation on clone() should make it clear what is supported and what is not.  Why is clone() a public API if there is no safe way to call it?

We currently implement close_from by hand and it would be great if glibc could support it (in a way that would be safe to call between vfork and exec).

Here's the current implementation comment from 
http://hg.openjdk.java.net/jdk/jdk/file/tip/src/java.base/unix/native/libjava/ProcessImpl_md.c

 * There are 4 possible strategies we might use to "fork":
 *
 * - fork(2).  Very portable and reliable but subject to
 *   failure due to overcommit (see the documentation on
 *   /proc/sys/vm/overcommit_memory in Linux proc(5)).
 *   This is the ancient problem of spurious failure whenever a large
 *   process starts a small subprocess.
 *
 * - vfork().  Using this is scary because all relevant man pages
 *   contain dire warnings, e.g. Linux vfork(2).  But at least it's
 *   documented in the glibc docs and is standardized by XPG4.
 *   http://www.opengroup.org/onlinepubs/000095399/functions/vfork.html
 *   On Linux, one might think that vfork() would be implemented using
 *   the clone system call with flag CLONE_VFORK, but in fact vfork is
 *   a separate system call (which is a good sign, suggesting that
 *   vfork will continue to be supported at least on Linux).
 *   Another good sign is that glibc implements posix_spawn using
 *   vfork whenever possible.  Note that we cannot use posix_spawn
 *   ourselves because there's no reliable way to close all inherited
 *   file descriptors.
 *
 * - clone() with flags CLONE_VM but not CLONE_THREAD.  clone() is
 *   Linux-specific, but this ought to work - at least the glibc
 *   sources contain code to handle different combinations of CLONE_VM
 *   and CLONE_THREAD.  However, when this was implemented, it
 *   appeared to fail on 32-bit i386 (but not 64-bit x86_64) Linux with
 *   the simple program
 *     Runtime.getRuntime().exec("/bin/true").waitFor();
 *   with:
 *     #  Internal Error (os_linux_x86.cpp:683), pid=19940, tid=2934639536
 *     #  Error: pthread_getattr_np failed with errno = 3 (ESRCH)
 *   We believe this is a glibc bug, reported here:
 *     http://sources.redhat.com/bugzilla/show_bug.cgi?id=10311
 *   but the glibc maintainers closed it as WONTFIX.
 *
 * - posix_spawn(). While posix_spawn() is a fairly elaborate and
 *   complicated system call, it can't quite do everything that the old
 *   fork()/exec() combination can do, so the only feasible way to do
 *   this, is to use posix_spawn to launch a new helper executable
 *   "jprochelper", which in turn execs the target (after cleaning
 *   up file-descriptors etc.) The end result is the same as before,
 *   a child process linked to the parent in the same way, but it
 *   avoids the problem of duplicating the parent (VM) process
 *   address space temporarily, before launching the target command.
 *
 * Based on the above analysis, we are currently using vfork() on
 * Linux and posix_spawn() on other Unix systems.
Comment 10 Martin Buchholz 2018-09-10 16:49:59 UTC
2018 update ...

Java has been happily using vfork on Linux, and posix_spawn on other Unix systems for many years. Code to try using clone() on Linux has never gotten past prototype stage and we have no plans to try again.  Documentation on clone() should make it clear what is supported and what is not.  Why is clone() a public API if there is no safe way to call it?

We currently implement close_from by hand and it would be great if glibc could support it (in a way that would be safe to call between vfork and exec).

Here's the current implementation comment from 
http://hg.openjdk.java.net/jdk/jdk/file/tip/src/java.base/unix/native/libjava/ProcessImpl_md.c

 * There are 4 possible strategies we might use to "fork":
 *
 * - fork(2).  Very portable and reliable but subject to
 *   failure due to overcommit (see the documentation on
 *   /proc/sys/vm/overcommit_memory in Linux proc(5)).
 *   This is the ancient problem of spurious failure whenever a large
 *   process starts a small subprocess.
 *
 * - vfork().  Using this is scary because all relevant man pages
 *   contain dire warnings, e.g. Linux vfork(2).  But at least it's
 *   documented in the glibc docs and is standardized by XPG4.
 *   http://www.opengroup.org/onlinepubs/000095399/functions/vfork.html
 *   On Linux, one might think that vfork() would be implemented using
 *   the clone system call with flag CLONE_VFORK, but in fact vfork is
 *   a separate system call (which is a good sign, suggesting that
 *   vfork will continue to be supported at least on Linux).
 *   Another good sign is that glibc implements posix_spawn using
 *   vfork whenever possible.  Note that we cannot use posix_spawn
 *   ourselves because there's no reliable way to close all inherited
 *   file descriptors.
 *
 * - clone() with flags CLONE_VM but not CLONE_THREAD.  clone() is
 *   Linux-specific, but this ought to work - at least the glibc
 *   sources contain code to handle different combinations of CLONE_VM
 *   and CLONE_THREAD.  However, when this was implemented, it
 *   appeared to fail on 32-bit i386 (but not 64-bit x86_64) Linux with
 *   the simple program
 *     Runtime.getRuntime().exec("/bin/true").waitFor();
 *   with:
 *     #  Internal Error (os_linux_x86.cpp:683), pid=19940, tid=2934639536
 *     #  Error: pthread_getattr_np failed with errno = 3 (ESRCH)
 *   We believe this is a glibc bug, reported here:
 *     http://sources.redhat.com/bugzilla/show_bug.cgi?id=10311
 *   but the glibc maintainers closed it as WONTFIX.
 *
 * - posix_spawn(). While posix_spawn() is a fairly elaborate and
 *   complicated system call, it can't quite do everything that the old
 *   fork()/exec() combination can do, so the only feasible way to do
 *   this, is to use posix_spawn to launch a new helper executable
 *   "jprochelper", which in turn execs the target (after cleaning
 *   up file-descriptors etc.) The end result is the same as before,
 *   a child process linked to the parent in the same way, but it
 *   avoids the problem of duplicating the parent (VM) process
 *   address space temporarily, before launching the target command.
 *
 * Based on the above analysis, we are currently using vfork() on
 * Linux and posix_spawn() on other Unix systems.
Comment 11 Rich Felker 2018-09-10 19:10:18 UTC
As far as I can tell, clone() is safe to use without CLONE_VM. This limitation should probably be documented. With CLONE_VM, it's not clear to me what should happen. As noted before, there are strong reasons it won't work and is unsafe. There might be ways it could be made safe to use (or conditions under which it's already safe to use) if you also use CLONE_VFORK or arrange for the same sort of wait operation to take place in some other way.

It's plausible that CLONE_VM with the public clone() function should automatically create a valid TCB in the provided stack area (it doesn't do this now, but perhaps could be modified to do so), but I'm not sure at what point it would make sense to stop. Would it also need to reserve sufficient storage for TLS? Only for TLS that existed at the time of the clone call, or for dynamic TLS added later too? Would it be expected that the child be restricted to an async signal context, or could it interact with libc locks (malloc?) in the parent?
Comment 12 Martin Buchholz 2018-09-12 19:56:29 UTC
glibc's spawni.c does use CLONE_VM but there is other implementation magic in that file that keeps mere mortals from using the same strategy.

./sysdeps/unix/sysv/linux/spawni.c:361:		   CLONE_VM | CLONE_VFORK | SIGCHLD, &args);

See also discussion

"""system and popen fail in case of big application"""
Comment 13 Szabolcs Nagy 2018-09-18 12:01:08 UTC
(In reply to Martin Buchholz from comment #10)
> 2018 update ...
> 
> Java has been happily using vfork on Linux, and posix_spawn on other Unix
> systems for many years. Code to try using clone() on Linux has never gotten
> past prototype stage and we have no plans to try again.  Documentation on
> clone() should make it clear what is supported and what is not.  Why is
> clone() a public API if there is no safe way to call it?

unfortunately the linux manuals mix the system call
(linux behaviour) and libc api (glibc behaviour)
in the same man page in general, and mainly focus
on the linux behaviour, not on the c api semantics.

there are safer and less safe flags to clone when
you call it from c, but this is not documented.

>  *   vfork whenever possible.  Note that we cannot use posix_spawn
>  *   ourselves because there's no reliable way to close all inherited
>  *   file descriptors.

why is it more reliable to close fds on other systems?
is it only linux that lacks closefrom?

>  * - clone() with flags CLONE_VM but not CLONE_THREAD.  clone() is
>  *   Linux-specific, but this ought to work - at least the glibc
>  *   sources contain code to handle different combinations of CLONE_VM
>  *   and CLONE_THREAD.  However, when this was implemented, it
>  *   appeared to fail on 32-bit i386 (but not 64-bit x86_64) Linux with

clone syscall with CLONE_VM is not usable from hosted c code.
(i don't think it was ever intended to be, there is no
reasonable way to specify the c language semantics)

>  * - posix_spawn(). While posix_spawn() is a fairly elaborate and
>  *   complicated system call, it can't quite do everything that the old
>  *   fork()/exec() combination can do, so the only feasible way to do
>  *   this, is to use posix_spawn to launch a new helper executable

what's wrong with using a helper executable?
(it can be /bin/sh, so you don't need a new
executable on the filesystem for this)

>  * Based on the above analysis, we are currently using vfork() on
>  * Linux and posix_spawn() on other Unix systems.

why?

glibc posix_spawn bugs should now be fixed and the close
on exec situation should be the same across unix systems.
Comment 14 Rich Felker 2018-09-18 14:33:24 UTC
To elaborate on what Szabolcs Nagy said in #13 about using the shell, it can be done safely with no need for quoting on the caller side via something like (example for cd):

char *argv[] = { "sh", "-c", "cd \"$1\" && shift && exec \"$@\"", "sh", dir, prog, argv1, argv2, ..., 0 }
Comment 15 Martin Buchholz 2018-09-18 18:12:08 UTC
(In reply to Rich Felker from comment #14)
> To elaborate on what Szabolcs Nagy said in #13 about using the shell, it can
> be done safely with no need for quoting on the caller side via something
> like (example for cd):
> 
> char *argv[] = { "sh", "-c", "cd \"$1\" && shift && exec \"$@\"", "sh", dir,
> prog, argv1, argv2, ..., 0 }

Using sh seems risky and insufficient.  Are you finding sh via user's PATH?  Where is the POSIX version of sh?  Maybe it's /usr/xpg4/bin/sh ?!  Is cd subject to the CDPATH environment variable?  What about closing file descriptors? etc ...
Comment 16 Martin Buchholz 2018-09-18 18:20:34 UTC
(In reply to Szabolcs Nagy from comment #13)
> (In reply to Martin Buchholz from comment #10)
> 
> >  *   vfork whenever possible.  Note that we cannot use posix_spawn
> >  *   ourselves because there's no reliable way to close all inherited
> >  *   file descriptors.
> 
> why is it more reliable to close fds on other systems?
> is it only linux that lacks closefrom?

closefrom is non-standard and not available on Linux.
IIRC I wrote that before the strategy of exec'ing a helper program was implemented.

> what's wrong with using a helper executable?
> (it can be /bin/sh, so you don't need a new
> executable on the filesystem for this)
> 
> >  * Based on the above analysis, we are currently using vfork() on
> >  * Linux and posix_spawn() on other Unix systems.
> 
> why?

Historical - I implemented the vfork-based solution, targeting Linux; then others implemented the posix_spawn-based solution for other Unix systems.

> 
> glibc posix_spawn bugs should now be fixed and the close
> on exec situation should be the same across unix systems.

We are considering moving from vfork to posix_spawn on Linux as you suggest!
Comment 17 Rich Felker 2018-09-18 18:31:37 UTC
CDPATH is only processed if the argument does not begin with /; I'm assuming you would pass an absolute pathname.

Normally you would also just assume sh is in the PATH, and that it at least minimally resembles a POSIX shell, and use posix_spawnp. For specialized needs like suid of course it would make sense to use an absolute pathname and a specialized helper program.

The shell can perform file descriptor closing if you want (exec %d<&-). I'm in the "closeall is a bogus operation" camp but of course this is controversial.