This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Fallout from dlopen() blocking SIGSYS
- From: Christian Brauner <christian dot brauner at ubuntu dot com>
- To: Florian Weimer <fweimer at redhat dot com>
- Cc: Gian-Carlo Pascutto <gpascutto at mozilla dot com>, libc-alpha at sourceware dot org, Emilio Cobos Álvarez <ealvarez at mozilla dot com>, Jed Davis <jld at mozilla dot com>
- Date: Thu, 5 Dec 2019 17:34:35 +0100
- Subject: Re: Fallout from dlopen() blocking SIGSYS
- References: <be38a4dd-f573-6251-57e5-6c118255ce59@mozilla.com> <878snqhia3.fsf@oldenburg2.str.redhat.com>
On Thu, Dec 05, 2019 at 05:03:00PM +0100, Florian Weimer wrote:
> * Gian-Carlo Pascutto:
>
> > Block signals during the initial part of dlopen
> > (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
> >
> > is going to break every Firefox release of the last few years. We use a
> > seccomp-bpf filter to sandbox various processes. In some of these
> > processes we don't want to do a dlopen() of untrusted code while we're
> > not sandboxed yet, for example in the process we use to isolate Google's
> > Widevine DRM modules from any private data on the system.
> >
> > seccomp-bpf will intercept various filesystem related syscalls and raise
> > SIGSYS, at which moment our code will contact a broker in the parent
> > process that checks if the file that's being want to read is acceptable
> > to us, and then passes down the file handle.
>
> I have re-reviewed the referenced patch and posted:
>
> <https://sourceware.org/ml/libc-alpha/2019-12/msg00175.html>
> <https://sourceware.org/ml/libc-alpha/2019-12/msg00176.html>
> <https://sourceware.org/ml/libc-alpha/2019-12/msg00177.html>
>
> Lazy binding is buggy and has races, but with the new patches, the
> NODELETE changes should not make matters worse.
>
> But I think we do need something better for seccomp sandboxing in the
> medium term, so I'm happy to have a larger conversation now.
>
> Is there actually a signal handler for SIGSYS in the monitored process?
> Based on some discussion I've seen, I think the kernel pushes a signal
> context on the thread stack (otherwise there wouldn't be a signal mask
> to patch), handler or not. This alone as compatibility implications.
>
> There are cases where we absolutely have to block all signals for
> correctness purposes. Some reasons are:
>
> (a) Implementing async-signal-safe functions on top of something that is
> not async-signal-safe.
>
> (b) Avoid running user code with the wrong TCB or an uninitialized TCB.
>
> (c) Prevent the kernel from pushing the signal context onto a stack that
> is too small.
>
> (d) Avoid running user code on a stack that is too small.
>
> (e) Enable reuse of the stack pointer register for something else.
>
> Particularly for (a), I expect to see more cases in the future. I don't
> know which system calls we would run in such critical sections. The
> usage in dlopen falls into that category, but it's a very incomplete fix
> and not very useful overall.
>
> Unfortunately, (b) is generally necessary around clone system calls.
> It's essential for correct use of vfork-like clone in posix_spawn. We
> don't do it for pthread_create today, but this results in a bug we want
> to fix (see bug 25098).
>
> (c) is relevant to the current use of clone in vfork because we have a
> small stack there. I think this impacts seccomp monitoring even if
> there is no actual signal handler because of the signal context data
> written by the kernel. I want to add a clone_samestack system call
> wrapper that avoids this issue, but I haven't done that yet.
>
> (d) is a more problematic variant of (c). That's a secondary issue with
> the vfork as wellin addition to (b). I don't think (d) is something we
> do a lot in glibc, but applications may do it. Perhaps they use
> sigaltstack instead.
>
> glibc currently does not do (e) as far as I know, but there are some
> applications which use %esp on i386 as a general-purpose register. I
> doubt this use case is relevant to Firefox anyway.
>
> Please do not underestimate the stack usage for the signal context. If
> I recall correctly, on current x86, it is more than 5 KiB, and on POWER,
> it's more than 10 KiB. Stack usage grows with newer kernel releases
> which bring support for larger register files. Some of this overhead
> comes from the red zone (the stack region below the stack pointer that
> signal handlers cannot touch), and that's part of the ABI definition.
> But the variable-sized part of the signal context is not exported from
> the kernel, so it's hard for applications to size stacks appropriately.
> (That's why I'm interested in clone_samestack for glibc's internal use.)
>
> For (a), we really need a list of system calls which are safe to perform
> in such critical sections. Can we call your interposed malloc, or will
> that try to open files in /proc in some cases?
>
> When we fix bug 25098 and adopt clone3, you might be a bit of a problem
> because of the in-memory flags argument for clone3, and you can't
Fwiw, I have this on my agenda, i.e. making it possible for seccomp to
filter a certain __subset__ of system calls with pointer arguments. I've
started a discussion in August right before Kernel Summit:
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2019-July/006699.html
and Kees Cook and I gave a session at Kernel Summit in Lisbon:
https://www.youtube.com/watch?v=PnOSPsRzVYM
It's planned I just need to find time to work on this :/
Christian