Bug 14749

Summary: Dangerous race condition with vfork in posix_spawn
Product: glibc Reporter: Rich Felker <bugdal>
Component: libcAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: adhemerval.zanella, carlos, dan, drepper.fsp, fweimer, izbyshev, sionescu+BugTrackers
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Rich Felker 2012-10-21 16:58:05 UTC
posix_spawn uses vfork (to avoid momentary doubling commit charge and improve performance) in cases where it seems "safe", or when explictly requested. However, at least one major race condition seems to have been missed:

Consider a program running with elevated privileges (perhaps a daemon or suid program which initially has root) which is multi-threaded, and which will drop privileged and then execute untrusted code (perhaps a user-provided script or module). The scenario looks like:

Thread A is calling posix_spawn to run a fixed external command (call it child C) that will work fine, and which is safe to invoke, with either the initial privileges or the reduced privileges. Think of something stupid like running "date" to get the current date and time.

Thread B is calling setuid() to drop privileges, then executing untrusted code.

And let's suppose events happen in the following order:

A: vfork
C: vfork returns in child
B: setuid
B: untrusted code runs and pokes at memory A is using
C: now running arbitrary code as root
C: ...
A: vfork returns in parent

Fundamentally, the danger of this race is the possibility of it giving rise to two threads/processes sharing an address space, but with different privileges; this kind of situation must never be allowed to arise.

The simplest way to avoid the race is by using fork instead of vfork, unless vfork is specifically requested. However, that brings back the double-commit-charge issue. An alternative fix is to hold a lock that prevents changing uids/gids during the vfork window. This is also easy since NPTL is already doing a global lock for set*id() to synchronize the id changes across all the threads (since Linux requires each thread to make its own set*id() syscall).
Comment 1 Carlos O'Donell 2014-09-20 04:13:08 UTC
I agree this should be fixed. I think posix_spawn and setXid functions should serialize against eachother. Note that vfork should not serialize against setXid functions becuase there I think users need to be clever enough to know what they are doing, but we should still provide better documentation and example code.
Comment 2 Daniel Drake 2018-03-06 22:55:25 UTC
Thanks for the work done on identifying vfork issues like these and greatly improving posix_spawn() in recent years.

Took me a little while to get my head around this, so here is a comment that may help others reading, and may be useful when writing those docs.

> Fundamentally, the danger of this race is the possibility of it giving rise
> to two threads/processes sharing an address space, but with different
> privileges; this kind of situation must never be allowed to arise.

The Linux kernel setuid system call explicitly allows this undesirable situation to happen (credentials can be changed on a per-thread basis), so my first thought was that "this is not a vfork-specific problem, can't unprivileged thread B already directly attack privileged thread A?" (in the above example)

But the dominant glibc threading implementation indeed goes to some lengths to avoid this. The nptl man page explains that a glibc setuid() call will cause the credentials to be changed over all threads in synchronized fashion. So this situation *is* ordinarily avoided when you use the standard threading API.

The water only gets muddy when you throw some other thread/process API calls into the mix - like vfork(), clone(), or posix_spawn() - all of which can create thread-like entities in an inopportune moment that will escape the attempt to drop privileges throughout the whole process.

With that understood, it seems clear that a higher level API like posix_spawn() should have assurances to avoid this race.
Comment 3 Florian Weimer 2018-03-07 09:53:19 UTC
(In reply to Rich Felker from comment #0)
> Fundamentally, the danger of this race is the possibility of it giving rise
> to two threads/processes sharing an address space, but with different
> privileges; this kind of situation must never be allowed to arise.

Why?  The kernel does this, too, and user-space file servers really want this functionality as well.
Comment 4 Alexey Izbyshev 2019-01-18 16:15:14 UTC
Since 802c1c5a6539024af3c51fd11e6f3cda1f850c62 glibc blocks all signals (even the signal used for setxid synchronization) before clone(CLONE_VM|CLONE_VFORK). Those signals can be unblocked in the parent only after the child exits or execs. 

Doesn't it mean that setuid() called by some thread in the parent is effectively serialized with clone(CLONE_VM|CLONE_VFORK)? If so, this issue can be considered fixed. Or is it a problem that the signals are blocked only in the thread that called clone(), so that setuid() syscall can still be performed by other threads while the child is running? (I haven't checked setuid() implementation for whether it's actually possible.)
Comment 5 Adhemerval Zanella 2019-01-28 15:22:34 UTC
(In reply to Alexey Izbyshev from comment #4)
> Since 802c1c5a6539024af3c51fd11e6f3cda1f850c62 glibc blocks all signals
> (even the signal used for setxid synchronization) before
> clone(CLONE_VM|CLONE_VFORK). Those signals can be unblocked in the parent
> only after the child exits or execs. 
> 
> Doesn't it mean that setuid() called by some thread in the parent is
> effectively serialized with clone(CLONE_VM|CLONE_VFORK)? If so, this issue
> can be considered fixed. Or is it a problem that the signals are blocked
> only in the thread that called clone(), so that setuid() syscall can still
> be performed by other threads while the child is running? (I haven't checked
> setuid() implementation for whether it's actually possible.)

My understanding of the issue described by Rich is although __nptl_setxid synchronizes the id changes across all threads, previous posix_spawn implementation allowed the helper process to possible run with a different ID because it could issue the internal SIGSETXID handler (sighandler_setxid).

The current implementation will serialize the setXid execution only for the effective thread that is calling posix_spawn (__nptl_setxid might change the effective ID for other threads). My understanding is this should suffice to fix the described race condition from comment #1. 

However, I am not sure if there is still an issue that another unrelated thread might issue posix_spawn with a different ID than other thread in the case of is signaled by SIGSETXID by __nptl_setxid, while setXid is still running for all other threads. To fix it we will need to add another synchronization on sighandler_setxid to only finish execution *after* the calling __nptl_setxid issues its syscall.
Comment 6 Alexey Izbyshev 2019-02-02 22:44:42 UTC
Thank you for the response, Adhemerval. Now I see that since SIGSETXID signal handler doesn't synchronize with other threads, the following scenario may happen in an initially privileged process:

A: vfork
C: vfork returns in child
B: setuid <blocked until vfork() returns in A and unblocks SIGSETXID>
D: executes SIGSETXID handler and drops privileges
D: untrusted code runs and pokes at memory C is using
C: now running arbitrary code as root
C: ...
A: vfork returns in parent

So, indeed, while the exact scenario from comment #1 is prevented, a variant of that scenario (requiring at least 3 threads) is still possible, and the bug stands.
Comment 7 Adhemerval Zanella 2019-02-12 21:04:41 UTC
I take a look and the main problem on how to correct synchronize setXid and the internal SETXID signal handler to signal *all* the threads without a race condition.

I think we can use the stack_cache_lock since it is used for both cached allocated and user provided stacks. The only issue is user created processes through clone syscall and there isn't an easy solution for this (musl, for instance, iterates over /proc/self/task and it seems to be racy [1]).

We might assume the clone scenario is not really supported (as we stated in the previous discussion), however since it is also primarily a security hardening issue should we really not consider it?

[1] https://www.openwall.com/lists/musl/2019/02/02/2