This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Fallout from dlopen() blocking SIGSYS

From: Rich Felker <dalias at libc dot org>
To: Christian Brauner <christian dot brauner at ubuntu dot com>
Cc: Gian-Carlo Pascutto <gpascutto at mozilla dot com>, libc-alpha at sourceware dot org, Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, Emilio Cobos Álvarez <ealvarez at mozilla dot com>, Jed Davis <jld at mozilla dot com>, Florian Weimer <fweimer at redhat dot com>
Date: Thu, 5 Dec 2019 09:39:21 -0500
Subject: Re: Fallout from dlopen() blocking SIGSYS
References: <be38a4dd-f573-6251-57e5-6c118255ce59@mozilla.com> <256bfb1c-47c9-ebd0-db2c-c7720237b4ec@linaro.org> <20191204221319.GZ16318@brightrain.aerifal.cx> <20191205094724.fj43mzwyfhm2lsbc@wittgenstein>

On Thu, Dec 05, 2019 at 10:47:25AM +0100, Christian Brauner wrote:
> On Wed, Dec 04, 2019 at 05:13:19PM -0500, Rich Felker wrote:
> > On Wed, Dec 04, 2019 at 05:46:54PM -0300, Adhemerval Zanella wrote:
> > > 
> > > 
> > > On 03/12/2019 11:31, Gian-Carlo Pascutto wrote:
> > > > (reposting here per request from Florian Weimer)
> > > > 
> > > > This glibc patch:
> > > > 
> > > > Block signals during the initial part of dlopen
> > > > (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
> > > > 
> > > > is going to break every Firefox release of the last few years. We use a
> > > > seccomp-bpf filter to sandbox various processes. In some of these
> > > > processes we don't want to do a dlopen() of untrusted code while we're
> > > > not sandboxed yet, for example in the process we use to isolate Google's
> > > > Widevine DRM modules from any private data on the system.
> > > > 
> > > > seccomp-bpf will intercept various filesystem related syscalls and raise
> > > > SIGSYS, at which moment our code will contact a broker in the parent
> > > > process that checks if the file that's being want to read is acceptable
> > > > to us, and then passes down the file handle.
> 
> Hey everyone,
> 
> I saw this fly by the libc-alpha mailing list late at night yesterday. I
> think with work we've recently done in the upstream kernel the
> SECCOMP_RET_TRAP approach trapping and signaling is not needed anymore
> at least on newer kernels. That won't help you with legacy but it's
> still worth considering switching over.
> 
> Let me say, I'm sorry if this will be a bit longer and sorry if I
> should've misunderstood your use-case.
> 
> Afaict, you use SECCOMP_RET_TRAP to intercept syscalls and emulate them
> in userspace, i.e. the actual syscall is never actually performed. An
> alternative to SECCOMP_RET_TRAP is to use the SECCOMP_RET_USER_NOTIF
> feature that we released with Linux v5.0 which doesn not rely on signals
> nor ptrace.
> 
> Here's the gist:
> SECCOMP_RET_USER_NOTIF enables a process (supervisee) to retrieve an fd
> for its seccomp filter. This fd can then be handed to another (usually
> more privileged) process (supervisor). The supervisor will then be able
> to receive seccomp messages about the syscalls having been performed by
> the supervisee on the fd.
> 
> We have integrated this feature into userspace and currently make heavy
> use of this to intercept mknod(), mount(), and other syscalls in user
> namespaces aka in containers.
> For example, if the mknod() syscall matches a device in a pre-determined
> whitelist the privileged supervisor will perform the mknod syscall in
> lieu of the unprivileged supervisee and report back to the supervisee on
> the success or failure of its attempt. If the syscall does not match a
> device in a whitelist we simply report an error.
> 
> Here's a quick asciinema demo
> 
> https://asciinema.org/a/285491
> Be a little patient during the mount() interception I'm demonstrating :).
> 
> Here are more technical details:
> A task can registers a seccomp filter for a syscall with the
> SECCOMP_RET_USER_NOTIF flag set in the filter. When it loads the filter
> the caller can specify SECCOMP_FILTER_FLAG_NEW_LISTENER. This flag will
> instruct seccomp to return a so-called "notifier fd" which is an
> anonymous-inode-based, close-on-exec per default file descriptor. A
> dummy code-example from our kernel testing is:
> 
> struct sock_filter filter[] = {
> 	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
> 
> 	/* Let's listen for mknod() syscalls. */
> 	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_mknod, 0, 1),
> 
> 	/* If this is a mknod(), notify the listener. */
> 	BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> 
> 	/* If this wasn't a mknod() syscall just let it through. */
> 	BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> };
> 
> struct sock_fprog prog = {
> 	.len = (unsigned short)ARRAY_SIZE(filter),
> 	.filter = filter,
> };
> 
> int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> 
> The task now has a notify_fd to it's own seccomp filter. This notify_fd
> can now be handed of to your broker process.
> 
> The broker can now add that notify_fd into an epoll() loop. When a
> mknod() syscall (per the above example) is issued the kernel will set
> EPOLLIN on the notify_fd and the broker will be notified that a syscall
> has been performed and the task performing the syscall will now be
> blocked.
> The broker can then use the following ioctl:
> 
> struct seccomp_notif req = {};
> ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> 
> to receive information about the performed syscall. The information
> includes the pid of the calling task, an unique cookie identifying the
> request, and the seccomp data.
> 
> struct seccomp_notif {
> 	__u64 id;
> 	__u32 pid;
> 	__u32 flags;
> 	struct seccomp_data data;
> };
> 
> Non-pointer based arguments can be directly inspected by the broker
> withour risking a TOCTOU via seccomp_data. Pointer-based arguments can
> be safely inspected if the caller reads _all_ the data it is interested
> in via /proc/<pid>/mem before it makes a decision whether or not to
> emulate the syscall and checks that the request is still valid
> afterwards to avoid pid recycling:
> 
> char buf[PATH_MAX];
> int fd = open(/proc/<pid>/mem);
> 
> /* Read the path argument from the mknod() syscall. /*
> pread(C.int(fd, buf, sizeof(buf), req.data.args[0]);
> 
> /* 
>  * Verify that the task has not exited in the meantime and been recycled
>  * and we've read the wrong memory.
>  */
> ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0);
> 
> If that succeeds the broker knows, that the task is still alive and
> hasn't been recycled.
> Now the broker can inspect the path and the device number in the mknod()
> syscall and decide whether or not to emulate the syscall.
> 
> After the broker emulated the syscall it can instruct the kernel to
> report back an errno value to the task it listenes to via:
> 
> struct seccomp_notif_resp {
> 	__u64 id;
> 	__s64 val;
> 	__s32 error;
> 	__u32 flags;
> };
> 
> struct seccomp_notif_resp resp = {
> 	.id = req.id,
> 	/* Let's use a really dumb errno value in this example. */
> 	.error = ENOANO,
> };
> 
> ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);
> 
> At this point the calling task will see a very strange error value
> ENOANO for it's mknod() syscall.
> 
> In kernel 5.5 I've extended this feature to also allow continuing
> syscalls by setting the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag when
> sending a response, i.e.
> 
> struct seccomp_notif_resp resp = {
> 	.id = req.id,
> 	.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE,
> };
> 
> ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);
> 
> will cause the syscall to be performed. This is needed for syscalls that
> can't be emulated in userspace, e.g. in contrast to mknod(), a lot of
> other syscall we intercept (e.g.  setxattr()) cannot be easily filtered
> like mknod() because they have pointer arguments. Additionally, some of
> them might actually succeed in user namespaces (e.g. setxattr() for all
> "user.*" xattrs). Since we currently cannot tell seccomp to continue
> from a user notifier we are stuck with performing all of the syscalls in
> lieu of the container. This is a huge security liability since it is
> extremely difficult to correctly assume all of the necessary privileges
> of the calling task such that the syscall can be successfully emulated
> without escaping other additional security restrictions (think missing
> CAP_MKNOD for mknod(), or MS_NODEV on a filesystem etc.). This can be
> solved by telling seccomp to resume the syscall.
> 
> The continue feature obviously requires _massive amounts of caution_.
> Here's my comment from the kernel header:
> 
> "Note, the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution!
> If set by the process supervising the syscalls of another process the
> syscall will continue. This is problematic because of an inherent TOCTOU.
> An attacker can exploit the time while the supervised process is waiting on
> a response from the supervising process to rewrite syscall arguments which
> are passed as pointers of the intercepted syscall.
> It should be absolutely clear that this means that the seccomp notifier
> _cannot_ be used to implement a security policy! It should only ever be used
> in scenarios where a more privileged process supervises the syscalls of a
> lesser privileged process to get around kernel-enforced security
> restrictions when the privileged process deems this safe. In other words,
> in order to continue a syscall the supervising process should be sure that
> another security mechanism or the kernel itself will sufficiently block
> syscalls if arguments are rewritten to something unsafe.
> 
> Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> same syscall, the most recently added filter takes precedence. This means
> that the new SECCOMP_RET_USER_NOTIF filter can override any
> SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> such filtered syscalls to be executed by sending the response
> SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE."
> 
> In any case, maybe this is is a viable alternative to you on new kernels
> that allows you to avoid signals and such.
> 
> I'm happy to help review this or advice since we've had quite some
> experience implementing and making use of this.
> I've spoken on this feature a few times before and will be talking about
> it during FOSDEM too.

This is great news, and sounds pretty much exactly like what I
expected a replacement should look like. Thanks for the detailed
reply!

Rich

References:
- Fallout from dlopen() blocking SIGSYS
  - From: Gian-Carlo Pascutto
- Re: Fallout from dlopen() blocking SIGSYS
  - From: Adhemerval Zanella
- Re: Fallout from dlopen() blocking SIGSYS
  - From: Rich Felker
- Re: Fallout from dlopen() blocking SIGSYS
  - From: Christian Brauner

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]