This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Fallout from dlopen() blocking SIGSYS
On Thu, Dec 05, 2019 at 10:47:25AM +0100, Christian Brauner wrote:
> On Wed, Dec 04, 2019 at 05:13:19PM -0500, Rich Felker wrote:
> > On Wed, Dec 04, 2019 at 05:46:54PM -0300, Adhemerval Zanella wrote:
> > >
> > >
> > > On 03/12/2019 11:31, Gian-Carlo Pascutto wrote:
> > > > (reposting here per request from Florian Weimer)
> > > >
> > > > This glibc patch:
> > > >
> > > > Block signals during the initial part of dlopen
> > > > (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
> > > >
> > > > is going to break every Firefox release of the last few years. We use a
> > > > seccomp-bpf filter to sandbox various processes. In some of these
> > > > processes we don't want to do a dlopen() of untrusted code while we're
> > > > not sandboxed yet, for example in the process we use to isolate Google's
> > > > Widevine DRM modules from any private data on the system.
> > > >
> > > > seccomp-bpf will intercept various filesystem related syscalls and raise
> > > > SIGSYS, at which moment our code will contact a broker in the parent
> > > > process that checks if the file that's being want to read is acceptable
> > > > to us, and then passes down the file handle.
>
> Hey everyone,
>
> I saw this fly by the libc-alpha mailing list late at night yesterday. I
> think with work we've recently done in the upstream kernel the
> SECCOMP_RET_TRAP approach trapping and signaling is not needed anymore
> at least on newer kernels. That won't help you with legacy but it's
> still worth considering switching over.
>
> Let me say, I'm sorry if this will be a bit longer and sorry if I
> should've misunderstood your use-case.
>
> Afaict, you use SECCOMP_RET_TRAP to intercept syscalls and emulate them
> in userspace, i.e. the actual syscall is never actually performed. An
> alternative to SECCOMP_RET_TRAP is to use the SECCOMP_RET_USER_NOTIF
> feature that we released with Linux v5.0 which doesn not rely on signals
> nor ptrace.
>
> Here's the gist:
> SECCOMP_RET_USER_NOTIF enables a process (supervisee) to retrieve an fd
> for its seccomp filter. This fd can then be handed to another (usually
> more privileged) process (supervisor). The supervisor will then be able
> to receive seccomp messages about the syscalls having been performed by
> the supervisee on the fd.
>
> We have integrated this feature into userspace and currently make heavy
> use of this to intercept mknod(), mount(), and other syscalls in user
> namespaces aka in containers.
> For example, if the mknod() syscall matches a device in a pre-determined
> whitelist the privileged supervisor will perform the mknod syscall in
> lieu of the unprivileged supervisee and report back to the supervisee on
> the success or failure of its attempt. If the syscall does not match a
> device in a whitelist we simply report an error.
>
> Here's a quick asciinema demo
>
> https://asciinema.org/a/285491
> Be a little patient during the mount() interception I'm demonstrating :).
>
> Here are more technical details:
> A task can registers a seccomp filter for a syscall with the
> SECCOMP_RET_USER_NOTIF flag set in the filter. When it loads the filter
> the caller can specify SECCOMP_FILTER_FLAG_NEW_LISTENER. This flag will
> instruct seccomp to return a so-called "notifier fd" which is an
> anonymous-inode-based, close-on-exec per default file descriptor. A
> dummy code-example from our kernel testing is:
>
> struct sock_filter filter[] = {
> BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
>
> /* Let's listen for mknod() syscalls. */
> BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_mknod, 0, 1),
>
> /* If this is a mknod(), notify the listener. */
> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
>
> /* If this wasn't a mknod() syscall just let it through. */
> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> };
>
> struct sock_fprog prog = {
> .len = (unsigned short)ARRAY_SIZE(filter),
> .filter = filter,
> };
>
> int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
>
> The task now has a notify_fd to it's own seccomp filter. This notify_fd
> can now be handed of to your broker process.
>
> The broker can now add that notify_fd into an epoll() loop. When a
> mknod() syscall (per the above example) is issued the kernel will set
> EPOLLIN on the notify_fd and the broker will be notified that a syscall
> has been performed and the task performing the syscall will now be
> blocked.
> The broker can then use the following ioctl:
>
> struct seccomp_notif req = {};
> ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
>
> to receive information about the performed syscall. The information
> includes the pid of the calling task, an unique cookie identifying the
> request, and the seccomp data.
>
> struct seccomp_notif {
> __u64 id;
> __u32 pid;
> __u32 flags;
> struct seccomp_data data;
> };
>
> Non-pointer based arguments can be directly inspected by the broker
> withour risking a TOCTOU via seccomp_data. Pointer-based arguments can
> be safely inspected if the caller reads _all_ the data it is interested
> in via /proc/<pid>/mem before it makes a decision whether or not to
> emulate the syscall and checks that the request is still valid
> afterwards to avoid pid recycling:
>
> char buf[PATH_MAX];
> int fd = open(/proc/<pid>/mem);
>
> /* Read the path argument from the mknod() syscall. /*
> pread(C.int(fd, buf, sizeof(buf), req.data.args[0]);
>
> /*
> * Verify that the task has not exited in the meantime and been recycled
> * and we've read the wrong memory.
> */
> ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0);
>
> If that succeeds the broker knows, that the task is still alive and
> hasn't been recycled.
> Now the broker can inspect the path and the device number in the mknod()
> syscall and decide whether or not to emulate the syscall.
>
> After the broker emulated the syscall it can instruct the kernel to
> report back an errno value to the task it listenes to via:
>
> struct seccomp_notif_resp {
> __u64 id;
> __s64 val;
> __s32 error;
> __u32 flags;
> };
>
> struct seccomp_notif_resp resp = {
> .id = req.id,
> /* Let's use a really dumb errno value in this example. */
> .error = ENOANO,
> };
>
> ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);
>
> At this point the calling task will see a very strange error value
> ENOANO for it's mknod() syscall.
>
> In kernel 5.5 I've extended this feature to also allow continuing
> syscalls by setting the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag when
> sending a response, i.e.
>
> struct seccomp_notif_resp resp = {
> .id = req.id,
> .flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE,
> };
>
> ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);
>
> will cause the syscall to be performed. This is needed for syscalls that
> can't be emulated in userspace, e.g. in contrast to mknod(), a lot of
> other syscall we intercept (e.g. setxattr()) cannot be easily filtered
> like mknod() because they have pointer arguments. Additionally, some of
> them might actually succeed in user namespaces (e.g. setxattr() for all
> "user.*" xattrs). Since we currently cannot tell seccomp to continue
> from a user notifier we are stuck with performing all of the syscalls in
> lieu of the container. This is a huge security liability since it is
> extremely difficult to correctly assume all of the necessary privileges
> of the calling task such that the syscall can be successfully emulated
> without escaping other additional security restrictions (think missing
> CAP_MKNOD for mknod(), or MS_NODEV on a filesystem etc.). This can be
> solved by telling seccomp to resume the syscall.
>
> The continue feature obviously requires _massive amounts of caution_.
> Here's my comment from the kernel header:
>
> "Note, the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution!
> If set by the process supervising the syscalls of another process the
> syscall will continue. This is problematic because of an inherent TOCTOU.
> An attacker can exploit the time while the supervised process is waiting on
> a response from the supervising process to rewrite syscall arguments which
> are passed as pointers of the intercepted syscall.
> It should be absolutely clear that this means that the seccomp notifier
> _cannot_ be used to implement a security policy! It should only ever be used
> in scenarios where a more privileged process supervises the syscalls of a
> lesser privileged process to get around kernel-enforced security
> restrictions when the privileged process deems this safe. In other words,
> in order to continue a syscall the supervising process should be sure that
> another security mechanism or the kernel itself will sufficiently block
> syscalls if arguments are rewritten to something unsafe.
>
> Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> same syscall, the most recently added filter takes precedence. This means
> that the new SECCOMP_RET_USER_NOTIF filter can override any
> SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> such filtered syscalls to be executed by sending the response
> SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE."
>
> In any case, maybe this is is a viable alternative to you on new kernels
> that allows you to avoid signals and such.
>
> I'm happy to help review this or advice since we've had quite some
> experience implementing and making use of this.
> I've spoken on this feature a few times before and will be talking about
> it during FOSDEM too.
This is great news, and sounds pretty much exactly like what I
expected a replacement should look like. Thanks for the detailed
reply!
Rich