Bug 17214

Summary: Expose a clone variant that shares stacks instead of jumping to a new one
Product: glibc Reporter: Steven Stewart-Gallus <sstewartgallus00>
Component: nptlAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED OBSOLETE    
Severity: normal CC: adhemerval.zanella, bugdal, carlos, drepper.fsp, jld, rickyz
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Steven Stewart-Gallus 2014-07-29 21:23:44 UTC
GLibc caches PIDs. Much has been said about this. Regardless, some people have to use the low level clone system call which does not update the PID cache (for making new processes only, trying to make one's own threads that interact with NPTL would be a mess). I would like a way to reset the PID cache so that the low level clone system call is usable.
Comment 1 Carlos O'Donell 2014-07-30 04:07:40 UTC
(In reply to Steven Stewart-Gallus from comment #0)
> GLibc caches PIDs. Much has been said about this. Regardless, some people
> have to use the low level clone system call which does not update the PID
> cache (for making new processes only, trying to make one's own threads that
> interact with NPTL would be a mess). I would like a way to reset the PID
> cache so that the low level clone system call is usable.

Can you help me understand what you use clone for in this case?

What clone flags are you using?
Comment 2 Steven Stewart-Gallus 2014-07-30 17:39:06 UTC
Sure I'd love to explain more about my problem.

First I'm experimenting with sandboxing my program by using clone and
CLONE_NEWUSER and CLONE_NEWPID.

Doing unshare(CLONE_NEWPID | CLONE_NEWUSER) and then a fork doesn't
work because it leaves the parent process unable to use multiple
threads afterwards. It also can't use multiple threads in
initialization because of a race condition where pthread_join doesn't
wait until a thread is completely and totally destroyed (only almost
completely I think). I think I could loop at this point until I get a
succesful fork but this is also kind of ugly.

As a result, I'm now virtualized as having really strong privileges
with possibly the wrong PID in the cache and I need to setup my
sandbox by mounting and unsharing a bunch of stuff. I actually wanted
to exec (which would also reset the PID cache obviously) at this point
to a utility program to setup my sandbox but unfortunately
capabilities get dropped after an exec so I have to keep executing
after the clone. So now I need to do general purpose init or systemd
like code after the clone to setup my sandbox. I could hard code
process 1 in all of my code but I need to call into library code which
does things like use getpid (there are a few threading things which
use getpid).

Currently I work around this by forking again and just having PID 1 be
a simple init and PID 2 being systemd or OpenRC. Unfortunately, this
is hacky because it opens up the security problem of an attacker
ptracing PID 1 (which hasn't sandboxed itself because it is just a
simple init) and escaping the sandbox. I can then work around that by
setting PID 1 not dumpable but I dislike losing my ability to ptrace
things. Also it might not be a smart idea to have my monitor process
(PID 2) able to receive signals by attackers.

Another possible workaround is to use /proc/self/uid_map and
/proc/self/gid_map and setuid and setgid to different privileges after
init which seems like the right approach to me. Unfortunately, my user
space still hasn't rolled out the changes needed for me to have
multiple users in my VM (unless I start out as root which is another
barrel of problems).

My code works without having a method to reset the PID cache but I
think with such a method it would be much simpler and more robust.
Comment 3 Steven Stewart-Gallus 2014-07-30 17:41:17 UTC
I didn't mean PID 2 was literally systemd or OpenRC only that it did things similar to them.
Comment 4 Carlos O'Donell 2014-07-31 23:25:17 UTC
(In reply to Steven Stewart-Gallus from comment #2)
> Doing unshare(CLONE_NEWPID | CLONE_NEWUSER) and then a fork doesn't
> work because it leaves the parent process unable to use multiple
> threads afterwards. It also can't use multiple threads in
> initialization because of a race condition where pthread_join doesn't
> wait until a thread is completely and totally destroyed (only almost
> completely I think). I think I could loop at this point until I get a
> succesful fork but this is also kind of ugly.

The problem is that you're using clone directly. From the perspective of the implementation you've just looked behind that curtain, and we don't support user code calling clone.

Having said that, we are trying hard to understand the individual use cases for using clone to see if there is anything we can do to help provide a unified API that coordinates with threads and TLS.

Why does clone(CLONE_NEWPID|CLONE_NEWUSER)+fork leave the process unable to use multiple threads afterwards?

What race condition is there in pthread_join?
 
> As a result, I'm now virtualized as having really strong privileges
> with possibly the wrong PID in the cache and I need to setup my
> sandbox by mounting and unsharing a bunch of stuff. I actually wanted
> to exec (which would also reset the PID cache obviously) at this point
> to a utility program to setup my sandbox but unfortunately
> capabilities get dropped after an exec so I have to keep executing
> after the clone. So now I need to do general purpose init or systemd
> like code after the clone to setup my sandbox. I could hard code
> process 1 in all of my code but I need to call into library code which
> does things like use getpid (there are a few threading things which
> use getpid).

I expect that you are not doing this with threads running. If you clone after having created a thread the userspace thread structure will still holds the old tid, and we use that for various purposes (locking, signaling, threaded forking) which are now wrong in the currently new PID namespace.

> Currently I work around this by forking again and just having PID 1 be
> a simple init and PID 2 being systemd or OpenRC. Unfortunately, this
> is hacky because it opens up the security problem of an attacker
> ptracing PID 1 (which hasn't sandboxed itself because it is just a
> simple init) and escaping the sandbox. I can then work around that by
> setting PID 1 not dumpable but I dislike losing my ability to ptrace
> things. Also it might not be a smart idea to have my monitor process
> (PID 2) able to receive signals by attackers.

Exit from the first task (PID 1) and leave the child (PID 2) running?

> Another possible workaround is to use /proc/self/uid_map and
> /proc/self/gid_map and setuid and setgid to different privileges after
> init which seems like the right approach to me. Unfortunately, my user
> space still hasn't rolled out the changes needed for me to have
> multiple users in my VM (unless I start out as root which is another
> barrel of problems).

Fragile.

> My code works without having a method to reset the PID cache but I
> think with such a method it would be much simpler and more robust.

I don't disagree and this problem has come up once before from the linux containers people who have to do odd things to work around the issue.

I've emailed the lxc to ask them what they did.
Comment 5 Steven Stewart-Gallus 2014-08-01 21:59:16 UTC
> Why does clone(CLONE_NEWPID|CLONE_NEWUSER)+fork leave the process
> unable to use multiple threads afterwards?

CLONE_NEWPID gives a new PID namespace. Cloning off a new thread after
the PID namespace has been unshared would end up with two threads in
the same thread group being in two different PID namespaces. Not only
is that confusing and weird but it is also possibly a security
problem. As a result, unshare(CLONE_NEWPID) isn't allowed to be used
with other threads running and other threads can't be created after
unshare(CLONE_NEWPID).

> What race condition is there in pthread_join?

It's not really a problem in pthread_join or I wouldn't expect GLibc
to put in the difficult work to solve this case but it seems as if
pthread_join sometimes returns before a thread has been fully
destroyed (and only mostly destroyed). The basic problem is that
thread reports that it is destroyed BEFORE __exit_thread_inline is
called. Obviously, it is impossible (or at least would be really
hacky) for a thread to report that it is destroyed AFTER exiting. Of
course, there is always the possibility of asking for kernel
developers to create a system call to atomically report that a thread
is destroyed and destroy it at the same time. But I don't think this
use case is all that important. It might also be possible to use
waitpid with __WALL to solve this problem.
 
> I expect that you are not doing this with threads running. If you
> clone after having created a thread the userspace thread structure
> will still holds the old tid, and we use that for various purposes
> (locking, signaling, threaded forking) which are now wrong in the
> currently new PID namespace.

Yes.

> Exit from the first task (PID 1) and leave the child (PID 2) running?

Actually, I can't. Remember, if (PID 1) exits the whole system goes
down so PID 1 has to wait on PID 2 and report PID 2's exit status.

> > My code works without having a method to reset the PID cache but I
> > think with such a method it would be much simpler and more robust.

> I don't disagree and this problem has come up once before from the
> linux containers people who have to do odd things to work around the
> issue.

> I've emailed the lxc to ask them what they did.

Thank you very much.
Comment 6 Rich Felker 2014-08-26 04:38:28 UTC
The kernel already does report thread exit atomically as part of SYS_exit, via a futex wake; this is what the CLONE_CHILD_CLEARTID flag is for. So I don't think the problem you're concerned about with pthread_join actually exists.
Comment 7 Steven Stewart-Gallus 2014-08-26 18:11:13 UTC
Strange, Rich.

The source in pthread_join says:

    /* Wait for the child.  */
    lll_wait_tid (pd->tid);

However, I'm sure I wasn't mistaken when my code was randomly failing
at unshare(CLONE_NEWPID).

The idea that the pthread_join completed before the kernel recognized
the thread as dead seemed very logical to me but I may be wrong.
Alternatively, there could a tiny bug in the kernel or in GLibc as
this logic is probably infreguently tested.  I'll try to create a
reduced test for the problem.

In any case, unshare(CLONE_NEWPID); then fork is kind of fragile and
weird (for one, the spawned process after fork doesn't get immunity to
signals like an init should).
Comment 8 Steven Stewart-Gallus 2014-08-26 18:31:24 UTC
If I use the GLibc wrapper around clone that resets the PID cache
right?  So if people are loathe to expose a function to reset the PID
cache can they make clone or a variation of clone more convenient for
the fork style cases?  Maybe my problem can be solved by simply making
clone(NULL, NULL, etc..) behave like fork or doing something else
similar?  Because I was calling the system call directly because I
didn't want to create a new stack and function for no reason.
Comment 9 Ricky Zhou 2014-10-30 22:01:11 UTC
For what it's worth, we have a similar use case to what Steven mentioned. We would like to spawn new processes in different pid, mount, fs, etc. namespaces, but the clone wrapper requires specifying a new stack even when CLONE_VM is not specified.

We cannot unshare(CLONE_NEWPID); fork(); for each new process because Linux does not support calling unshare(CLONE_NEWPID) multiple times. As Steven said, I think either of the following would solve our problems:

 - Expose a way to invalidate glibc's PID cache
 - Allow child_stack to be NULL (leading to fork-like behavior) when CLONE_VM is not set.

I'd be happy to write up patches for these if there's interest :-)
Comment 10 Jed Davis 2014-12-18 00:15:27 UTC
(In reply to Ricky Zhou from comment #9)
> We cannot unshare(CLONE_NEWPID); fork(); for each new process because Linux
> does not support calling unshare(CLONE_NEWPID) multiple times. As Steven
> said, I think either of the following would solve our problems:
> 
>  - Expose a way to invalidate glibc's PID cache
>  - Allow child_stack to be NULL (leading to fork-like behavior) when
> CLONE_VM is not set.

Another alternative: expose a variant of fork() which can be passed additional flags to pass to clone; __fork_with_flags(0) would be equivalent to fork(), and it could fail with EINVAL if inappropriate bits are set (CLONE_VM, any of the tid set/clear flags, anything in the signal number field).  There may be a better name than __fork_with_flags, but hopefully the idea is clear enough.

The advantage over __clone with no child_stack is that it would run pthread_atfork handlers; this would allow taking advantage of, for example, malloc implementations that use pthread_atfork to allow allocation on the child side of a multithreaded fork.

Gecko is more or less able to work around this without too much effort because the child process creation already had to be refactored to work in the absence of pthread_atfork (https://bugzilla.mozilla.org/show_bug.cgi?id=772734#c15), but it would be nice to have the possibility of eventually getting away from that requirement.
Comment 11 Steven Stewart-Gallus 2014-12-19 00:13:00 UTC
It occurs to me that a clone that shares stacks would also be useful for vfork as one can't really manually implement a memory sharing fork with syscall unless one dives into mucky assembly.
Comment 12 Rich Felker 2014-12-19 01:17:17 UTC
Steven, a clone that behaved like vfork (returning twice in the same shared VM) would have all the same problems as vfork, i.e. it would be completely incompatible with modern compilers that do any nontrivial optimizations. vfork was eliminated for very good reasons, and it should not be brought back.
Comment 13 Steven Stewart-Gallus 2014-12-19 20:29:16 UTC
I am well aware of many of the problems of vfork. You are right that a huge problem is that it doesn't work well with certain compiler optimizations (such as done by Clang in particular). I suppose I really should just use vfork with a new stack and avoid one problem of it (although I'm not sure that works as vfork sets all memory except the current stack as read only in the process, maybe some hack where the function would need to jump to the new stack, vfork and then jump back in the child would be needed). As well, I would definitely prefer using posix_spawn over vfork but unfortunately I can't for a few use cases (also the current GLibc implementation of posix_spawn doesn't use the pipe trick to report errors). Also, vfork wasn't eliminated at all and is still around for very good reasons.
Comment 14 Rich Felker 2014-12-19 21:01:19 UTC
Your claim that "vfork sets all memory except the current stack as read only in the process" is false and generally impossible to implement for various reasons. At the kernel level vfork is identical to clone with flags=CLONE_VM|CLONE_VFORK|SIGCHLD, and all CLONE_VFORK does is block scheduling of the parent until the child successfully execs or terminates.

By "eliminated", I meant dropped from the standards and deprecated. Of course it still exists in implementations that provide it, but the formalism for what you can do after vfork is wrong with respect to compiler semantics and thus it's unusable. For instance in the code:

if (!(pid = vfork())) {
    execve(...);
    _exit(1);
}

the traditional rules have been followed, but since _exit is a _Noreturn function, the compiler is free to write the arguments for execve over top of storage that was being used for local variables or spilled/saved registers in the caller (assuming their addresses are not visible to execve). This is valid because they can never be accessed again in the child. But since the child and parent share memory, the parent's stack will be trashed when it resumes execution. There are hacks that could be done at the compiler level to recognize vfork as special and avoid this, but it's a game of whack-a-mole. Sharing a stack between processes is just a broken design.
Comment 15 Steven Stewart-Gallus 2015-01-26 22:32:52 UTC
It occurs to me that maybe clone with vfork could be exposed in a safe manner along the lines of the following (although GLibc would probably implement it in assembly directly).

__attribute__((noinline)) __attribute__((noclone))
__attribute__((no_sanitize_address)) static pid_t safe_vfork(
    int (*volatile f)(void *), void *volatile arg)
{
	__atomic_signal_fence(__ATOMIC_SEQ_CST);

	pid_t child = vfork();
	if (0 == child)
		_Exit(f(arg));

	return child;
}

Aside from the weirdness of vfork this should be no less safer than something along the lines of the following:

__attribute__((noinline)) __attribute__((noclone))
__attribute__((no_sanitize_address)) static pid_t safe_vclone(
    int volatile clone_flags, int (*volatile f)(void *), void *volatile arg)
{
	long maybe_page_size = sysconf(_SC_PAGE_SIZE);
	assert(maybe_page_size >= 0);

	long maybe_stack_min_size = sysconf(_SC_THREAD_STACK_MIN);
	assert(maybe_stack_min_size >= 0);

	size_t page_size = maybe_page_size;
	size_t stack_min_size = maybe_stack_min_size;

	/* We need an extra page for signals */
	size_t stack_size = stack_min_size + page_size;

	size_t stack_and_guard_size = page_size + stack_size + page_size;
	void *child_stack = mmap(
	    0, stack_and_guard_size, PROT_READ | PROT_WRITE,
	    MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN | MAP_STACK, -1, 0);
	if (0 == child_stack)
		return -1;

	/* Guard pages are shared between the stacks */
	if (-1 == mprotect((char *)child_stack, page_size, PROT_NONE))
		goto on_err;

	if (-1 == mprotect((char *)child_stack + page_size + stack_size,
	                   page_size, PROT_NONE))
		goto on_err;

	void *stack_start = (char *)child_stack + page_size + stack_size;

	__atomic_signal_fence(__ATOMIC_SEQ_CST);

	pid_t child =
	    clone(f, stack_start, clone_flags | CLONE_VM | CLONE_VFORK, arg);
	if (-1 == child)
		goto on_err;

	munmap(child_stack, stack_and_guard_size);

	return child;

on_err:
	;
	int errnum = errno;
	munmap(child_stack, stack_and_guard_size);
	errno = errnum;
	return -1;
}
Comment 16 Adhemerval Zanella 2017-05-05 21:34:25 UTC
GLIBC 2.24+ now does not cache the pid anymore and posix_spawn was rewritten with do pretty much as the second example in comment #15 exemplifies.  We still require a non-null stack for clone even with CLONE_VM, but current approach potentially solves commend #9 (which was not the intended bug report by the way).

Also, as Rich commented in #12, vfork was deprecated for various reason we I think we shouldn't bring it back as well.

I am inclined to consider this issue OBSOLETE and I will close it down in next week (12th may 2017) if I got not reply.
Comment 17 Adhemerval Zanella 2017-05-16 13:08:43 UTC
Closed as described in comment #16.