This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Futex error handling
- From: Torvald Riegel <triegel at redhat dot com>
- To: Rich Felker <dalias at libc dot org>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>, Darren Hart <dvhart at infradead dot org>
- Date: Tue, 16 Sep 2014 20:12:38 +0200
- Subject: Re: Futex error handling
- Authentication-results: sourceware.org; auth=none
- References: <1410881785 dot 4967 dot 292 dot camel at triegel dot csb> <20140916165607 dot GZ23797 at brightrain dot aerifal dot cx>
On Tue, 2014-09-16 at 12:56 -0400, Rich Felker wrote:
> On Tue, Sep 16, 2014 at 05:36:25PM +0200, Torvald Riegel wrote:
> > We got complains from the kernel side that glibc wouldn't react properly
> > to futex errors being returned.
>
> Are these complaints public anywhere?
I don't think so. Anyway, this point isn't really relevant to the
discussion -- just my personal motivation, kind of :)
> > Thus, I'm looking at what we'd need to
> > actually improve. I'm using this here as a documentation for futex
> > error codes: https://lkml.org/lkml/2014/5/15/356
> >
> > Generally, we have three categories of faults (ie, the cause for an
> > error/failure):
> > * Bug in glibc ("BL")
> > * Bug in the client program ("BP")
> > * Failures that are neither a bug in glibc nor the program ("F")
> >
> > Also, there are cases where it's not a "real" failure, but just
> > something that is expected behavior that needs to be handled ("NF").
> >
> > I'm not aware of a general policy about whether glibc should abort or
> > assert (ie, abort only with assertion checks enabled) when the fault is
> > in the BL or BP categories. I'd say we don't, because there's no way to
> > handle it anyway, and other things will likely go wrong; but I don't
> > have a strong opinion. Thoughts?
> >
> > For every futex op, here's a list of how I'd categorize the possible
> > error codes (I'm ignoring ENOSYS, which is NF when feature testing (or
> > BL)):
> >
> > FUTEX_WAIT:
> > * EFAULT is either BL or BP. Nothing we can do. Should have failed
> > earlier when we accessed the futex variable.
> > * EINVAL (alignment and timeout normalization) is BL/BP.
> > * EWOULDBLOCK, ETIMEDOUT are NF.
>
> I would distingish multiple versions of "BP" for EINVAL here. You seem
> to have mixed "program has invoked undefined behavior" (e.g. invalid
> synchronization object) with "program has provided an erroneous
> argument which the implementation is required to report" (e.g. invalid
> timespec contents). If you don't want to add a new class, the latter
> technically could just be considered NF; it's fully equivalent to NF
> in terms of how it has to be handled.
Good point on timespec.
> > FUTEX_WAKE, FUTEX_WAKE_OP:
> > * EFAULT can be BL/BP *or* NF, so we *must not* abort or assert in this
> > case. This is due to how futexes work when combined with certain rules
> > for destruction of the underlying synchronization data structure; see my
> > description of the mutex destruction issue (but this can happen with
> > other data structures such as semaphores or cond vars too):
> > https://sourceware.org/ml/libc-alpha/2014-04/msg00075.html
>
> Note that it's possible to use FUTEX_WAKE_OP in such a way that EFAULT
> is reserved for BL/BP (and not NF). I don't see any point in
> having/using FUTEX_WAKE_OP except for this purpose, but maybe I'm
> missing something.
I agree that I was a bit sloppy in the categorization. You're right
that depending on how it's used, EFAULT can be just BL/BP. This applies
to both FUTEX_WAKE and FUTEX_WAKE_OP, I think; the latter has just a
finite number of bits, so you can't avoid an ABA issue entirely. You
can use the latter like FUTEX_UNLOCK_PI though, to try to avoid the
mutex destruction issue.
So, to summarize, my categories kind of assume a "typical" use of those
operations in glibc. What I was trying to point out is that we can't
abort in the generic futex syscall code when we see EFAULT, because
that's wrong for typical uses of FUTEX_WAKE.