This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [MTASCsft PATCH WIP5 01/33] Multi Thread, Async Signal and Async Cancel safety documentation: intro
- From: Torvald Riegel <triegel at redhat dot com>
- To: Alexandre Oliva <aoliva at redhat dot com>
- Cc: libc-alpha at sourceware dot org, carlos at redhat dot com, mtk dot manpages at gmail dot com
- Date: Mon, 25 Nov 2013 20:54:48 +0100
- Subject: Re: [MTASCsft PATCH WIP5 01/33] Multi Thread, Async Signal and Async Cancel safety documentation: intro
- Authentication-results: sourceware.org; auth=none
- References: <20131113081059 dot 3464 dot 51385 dot stgit at frit dot home> <20131113081132 dot 3464 dot 30409 dot stgit at frit dot home> <1384859432 dot 32326 dot 364 dot camel at triegel dot csb> <orsiurva0g dot fsf at livre dot home> <1384956325 dot 3152 dot 591 dot camel at triegel dot csb> <orhab6t6m8 dot fsf at livre dot home> <1385051174 dot 3152 dot 1637 dot camel at triegel dot csb> <orvbzksz47 dot fsf at livre dot home>
On Fri, 2013-11-22 at 04:56 -0200, Alexandre Oliva wrote:
> On Nov 21, 2013, Torvald Riegel <triegel@redhat.com> wrote:
>
> > On Thu, 2013-11-21 at 08:02 -0200, Alexandre Oliva wrote:
> >> On Nov 20, 2013, Torvald Riegel <triegel@redhat.com> wrote:
> >>
> >> > What about linking statically and using LTO?
>
> >> That's something I'd considered at some point, before concluding that a
> >> lot of code in glibc just wasn't suited for this sort of use.
>
> > But that would be a limitation of the current glibc implementation, and
> > not an inherent property of any glibc implementation, right?
>
> Yup. My assignment was to audit our implementation and document
> thread-safety properties in it.
That's why I asked whether you had also documented these cases, because
they are potentially or likely unsafe.
> > IOW, it would be preferable to give such cases an "incorrectsync" tag
> > or similar
>
> For this situation, I'd use xguargs, although I wouldn't mark feof as
> MT-Unsafe, because, well, it really isn't; it's combining it with other
> calls in MT-Unsafe ways that would be, and one way to perform such
> combination is to perform this sort of unexpected inlining that LTO on a
> static glibc might end up performing.
Can you elaborate?
> >> A number of very small functions (feof comes to mind) are safe in
> >> spite of avoiding synchronization primitives because we assume they
> >> won't be inlined. What makes them safe currently is their single
> >> access to a single word;
>
> > I wouldn't say that they are safe. First, what non-inlining gives us is
> > that we get a (hopefully) high likelihood that the code that the
> > compiler will generate is similar to the code it would generate for an
> > atomic access with relaxed memory order. Having this high likelihood
> > might be fine for now, but I think it would be better to change this to
> > atomic memory accesses eventually.
>
> I agree. But it doesn't follow from this that they are not safe as they
> are, does it?
Is "high likelihood" sufficient for you? We don't have tests for this,
so we don't really know that they are safe, right?
> Now let's see whether the second argument accomplishes
> that...
>
> > Second, accesses with relaxed memory order are often not sufficient to
> > produce sequential consistency. Thus, on weak-memory-model
> > architectures, we can't anymore guarantee equivalence of a
> > multi-threaded execution of MT-Safe functions to some sequential
> > execution of the same functions; we just don't get a total order that
> > programmers can reason more easily about.
>
> Ok, this gets closer. So, my counter-argument, when it comes
> specifically to feof, is that it's ok if we get outdated information
> because, well, without synchronization, you risk getting outdated
> information, and there's no standard-mandated synchronization in feof.
> If you want to ensure you get up-to-date access, you perform additional
> synchronization, just like you would if you wanted to ensure that two
> operations on a stream are performed without intervening, conflicting
> operations on the same stream.
Whether your argument holds or not depends on your definition of
MT-Safety. I very much believe that a large number of people will
expect that MT-Safety means something along the lines of sequential
consistency (even if they won't call it that way). Meaning that you get
executions that are consistent with "some" sequential execution of the
same functions. Relaxed atomic accesses often don't give you this
guarantee.
(If anyone isn't clear on what the difference is, or disagrees that
there is a difference, please speak up and I will elaborate.)
You seem to have based your MT-Safety properties on a weaker set of
guarantees. But this isn't explained anywhere in the docs.
I'm aware that we postponed the discussion of how the standard should
define MT-Safety. Nonetheless, as I said quite a while ago, as soon as
we document something that we want people to understand properly, we
must give a sufficiently detailed definition of what *we* mean when we
say "MT-Safe".
The two short-term solutions that I see are that we either (1) give a
weak definition of MT-Safety or (2) make the MT-Safe functions that
don't use locks or such (and, in particular, just non-atomic accesses;
or employ order-breaking caching) MT-Unsafe because they don't provide
the stronger ordering guarantees.
Also, how should a user expect that feof doesn't have synchronization,
as you call it (what do you mean precisely?)? It's marked MT-Safe.
Is it only feof that uses nonatomic accesses and no other synchronizing
operations yet is marked MT-Safe, or do you remember other functions as
well?
> So, I still stand by my analysis that it is safe.
But you don't give a complete definition of what it means to be safe;
thus, even if you are right, it's not something people can be sure to
understand it in the way you intended it.
> >> if we were to inline it along with other calls that take
> >> such shortcuts, this would no longer make for consistent executions,
> >> because of the very reordering possibilities you bring up.
>
> > Did you make a list of these cases?
>
> 'fraid not :-(
Hmm :( Do you have a rough recollection of which subsystems this
happened in? Or any other estimate of how widespread this problem might
be?
> >> How about this:
> >>
> >> ... Calling them within such contexts invokes undefined behavior.
> >>
> >> ?
>
> > Maybe that's best, assuming that readers understand undefined behavior.
> > Perhaps we could add that it could have similar effects as a data race?
>
> I think this might give the impression the potential damage is far more
> limited than it could be, because data races aren't quite as dangerous
> as destroying the universe ;-)
No, they are as dangerous (ie, they are undefined behavior).
> E.g., feof's access of stream status bits
> without synchronization is technically a data race, but a Mostly
> Harmless (TM) one :-) Calling an MT-Unsafe function, OTOH, could result
> in crashes, corruption, and a lot more that isn't immediately associated
> with data race!
Data races can lead to exactly the same issues due to what the compiler
might do with the code assuming that it's data-race-free...
In any case, maybe this shows that likening it to a data race would not
help people to understand it ;)
> > That is a useful summary; should it be made part of the documentation
> > (in suitably rephrased form)?
>
> I think it already is! :-( Now, I hope we can figure out what it was
> that set you down a different path, in which you still are (see below),
> and fix that...
>
> >> Sorry, I don't follow. Can you expand on what you have in mind?
>
> > The documentation defines MT-Safe without mentioning that there are
> > constraints.
>
> Because there aren't. Well, other than that of not calling MT-Unsafe
> functions.
>
> Anything else that you seem to have misunderstood as constraints on
> general programs that apply even to MT-Safe functions are the result of
> a misunderstanding: as long as you stick to MT-Safe functions (and
> observe MT-Safety requirements on your own code, and don't otherwise
> invoke undefined behavior), you're MT-Safe.
>
> The constraints are for cases in which you want to step *out* of the
> MT-Safety zone, as in, you can to call MT-Unsafe functions in MT
> programs. In order to do that, you may have to take additional care.
But then they are also, effectively, constraints on MT-Safe functions;
if you need to have these requirements to make a generally MT-Unsafe
function MT-Safe under these requirements, then this is equivalent to a
constraint on an MT-Safe function, right?
> > I was speaking about "global" constraints because I incorrectly
> > remembered "envromt" just being attached to those modifying the
> > environment, not also to those environment-reading functions that are
> > MT-Safe under the no-concurrent-env-modification constraint.
>
> I'm pretty sure I'm responsible for this confusion myself; as the
> definitions of the exceptions evolved, significant details changed. For
> example, right now, if you search for envromt in manual/* in the
> lxoliva/thread-safety-docs branch, you (probably) won't find a
> distinction between readers and writes, except for the fact that setters
> are marked with uunguard.
>
> I'm very close to deciding that these cases ought to be marked with
> uunguard:<what>, and that the various cases of <what> (for uunguard and
> any other notes that may have a counterpart) should be defined in a
> separate itemization in the manual, i.e., not under:
>
> When a function is safe to call only under certain constraints, we
> will add keywords to the safety notes whose meanings are defined as
> follows:
>
> nor
>
> Additional safety issues that cannot be worked around by constraining
> the program (other than by refraining from calling the affected
> functions) are also documented with keywords, whose meanings are
> defined as follows:
>
> but under a different section headed:
>
> Additional keywords may be attached to functions, indicating features
> that do not make a function unsafe to call, but that may need to taken
> into account in certain classes programs:
>
> I'd move (and adjust) glocale, envromt, memleak and fdleak under the
> paragraph above. I'd also move the itemization under the âsafe to call
> only underâ after the âadditional safety issuesâ, and rewrite their
> headers as follows:
>
> Functions that are unsafe to call in certain contexts are annotated
> with keywords that document their features that make them unsafe to
> call:
> @item staticbuf ...
> @item asi18n ...
> ...
>
> and
>
> For some features that make functions unsafe to call in certain
> contexts, there are known ways to avoid the safety problem other than
> refraining from calling the function altogether. The keywords that
> follow refer to such features, and each of their definitions indicate
> how the whole program needs to be constrained in order to remove the
> safety problem indicated by the keyword. Only when all the reasons
> that make a function unsafe are observed and addressed by applying the
> documented constraints does the function become safe to call in a
> context.
> @item uunguard ...
> @item xguargs...
> ...
>
> > This might show that I didn't pay enough attention or that I easily
> > forget things; OTOH, it could also indicate that it would be useful to
> > have a better explanation of how your annotations work :)
>
> How's the above?
Sorry, I can't really follow this with just the changes, and not a full
patch.
Perhaps it would be more useful to explain the structure of your safety
model right at the top, instead of explaining it bitwise as one reads
along? For example, say that we got generally Safe and Unsafe, and that
for some functions we consider generally Unsafe to be actually Safe
under some constraints, and that we'll discuss the constraints in detail
further down. Etc.
> > That sounds alright to me. For every function marked this way, is it
> > documented which the "affected objects" are, and obvious which function
> > can be writer?
>
> No, but it became clear to me in this thread that they should be. So,
> I'm going to extend the definition of uunguard to indicate that, when it
> is followed by a :<what>, the <what> may be documented in the new
> itemization I mentioned above, or just that the <what> identifies the
> group of functions that all operate on the same object, so that they all
> need to be guarded by the same mutex. Come to think of it, staticbuf
> could get this sort of annotation as well, and then it could be promoted
> to the group that can be worked around. Or those cases in which it
> could should become uunguard. Or something like that ;-)
That sounds good, based on this description.
> > But see below; maybe it's just that you listed all of them under in one
> > list about MT-Safe constraints...
>
> Yeah, that was a big part of the problem.
>
> But not that it's Safety constraints, not MT-Safe constraints. Not all
> keywords have to do with MT-safety issues; some are AS- and/or AC-safety
> issues. I'm still trying to figure out what drove you down this wrong
> path of narrowing the keywords in that section to MT-Safety; any ideas?
Just considering MT-Safety without an intended loss of generality.
Sorry if that made it look like confusion on my side...
> >> > What I'm saying is that the definition of MT-Safe at the very top should
> >> > link to the listing of constraints.
>
> >> Hmm, is this a result of the disconnect/misunderstanding we've detected?
> >> I don't think linking the definition of MT-Safe to misfeatures that may
> >> cause functions to be MT-Unsafe makes much sense.
>
> > I think it does, because otherwise readers can read just the definition
> > of MT-Safe and aren't aware that it comes with constraints.
>
> I tink we've now clarified that they don't come with constraints, right?
> If so, is there anything left to be done WRT the suggestion upthread,
> quoted here with â>> >â?
I still think you need to give a high-level overview first about how
your model works from a bird's eye view. E.g., if they aren't
constraints, mention that some of the requirements allow for an upgrade
of MT-Unsafe to MT-Safe, or some such.
> > Well, maybe we should just use a more descriptive name then, so that it
> > matches the definition? :)
>
> How do you refer to a piece of code that fails to release a lock it
> should have released? Other than buggy, I mean ;-) lacksunlock? :-)
"lacks-unlock" might be fine, "may-lack-unlock" or
"may-not-release-locks" too (but longer), etc.