This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: TLS redux [2.19]
- From: Roland McGrath <roland at hack dot frob dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>
- Cc: "GNU C. Library" <libc-alpha at sourceware dot org>, Konstantin Serebryany <kcc at google dot com>
- Date: Thu, 30 Jan 2014 16:23:01 -0800 (PST)
- Subject: Re: TLS redux [2.19]
- Authentication-results: sourceware.org; auth=none
- References: <20140115022335 dot EB13174430 at topped-with-meat dot com> <52E13BBE dot 8090709 at redhat dot com>
I'm just going to discuss the immediate issue for 2.19 now. I think
there is consensus on the overall direction for 2.20 and we can
discuss the details more after 2.19 has sailed.
In short, I still disagree with the conclusions people have come to
here. I consider the nature of the rationales applied to be a deep
cognitive failure to be appropriately conservative.
Rich Felker's characterization of either my position or the historical
glibc position as a "DOS/Windows approach" ("bug compatibility") is
fundamentally inaccurate. I won't go into more discussion about an
abstract topic now, since it's a distraction.
The fact that LeakSanitizer was shown to be broken by the changes is
*not* what makes my hesitance correct! It's a specific, concrete
example of why I am correct in general. It is fundamentally
wrong-headed to conclude from this that if we find a way to change
LeakSanitizer so it works then the problem is solved.
1. People using old versions of LeakSanitizer with glibc-2.19 should
not be broken. Doing so is an ABI regression.
2. There are probably other things that are broken too. Conservatism
means presuming that there might be and accepting that they matter
even if we are not aware of them now and even if we are not aware
of them a year from now.
If you find yourself saying, "Oh, we found the one application that
our ABI-breaking change actually broke, and we changed that
application, so it doesn't count as an ABI-breaking change any
more," then you are Just Plain Wrong.
3. In response to Joseph's question, yes, you can replace malloc and
have the dynamic linker call your malloc. It uses normal PLT calls
for malloc, calloc, realloc, free, and __libc_memalign. So it will
use an application-supplied allocator just the same way libc does.
It's true that the early allocations done at startup time use an
allocator private to the dynamic linker (whose allocations can
never be freed). These are disjoint from allocations made after
startup. The dynamic linker should never attempt to free or
realloc these allocations; if it does so, that's a bug, but there
are no known or reported bugs of this nature (at least in recent
years).
4. In response to Paul's point, yes, replacing malloc and getting
everything right is hard. That's really neither here nor there.
Existing things are doing it right already, and the rules being
arcane but staying the same for years is a very different thing
from the rules shifting under your feet.
5. In response to one of Rich's several trolling mischaracterizations,
there is no example of an "undocumented internal interface any
application developer might ever have discovered and (ab)used"
here. There is no internal interface involved at all. There is
indeed an undocumented subtlety, but many things that are stably
well-specified in actual fact, are unfortunately not formally
documented. Certainly we should reduce subtlety and increase
documentation in the future, but that does not relieve us of our
obligations to maintain ABI stability today.
6. The supposed urgency of this issue comes entirely from Google for
Google's uses on production servers. Google does not produce any
glibc binaries distributed outside the company(*). Google does not
distribute any glibc-using binaries that are believed to be
affected by this issue. Google already uses a bespoke modified
glibc on production servers, so having changes upstream in a
particular release is not actually a practical constraint on what
Google can roll out on its servers.
I thus conclude that there is in fact no urgency whatsoever for
this issue. We have rough consensus on a new approach for 2.20
that will address the immediate issue without introducing any
compatibility risks. IMHO that is sufficient for the medium and
long terms, and there is no need for anything at all in the short
term. The status quo ante (2.18) is better than the proposed new
incompatibility for 2.19. The yet-newer scheme proposed for 2.20
(with various details to be ironed out) is better than either, so
getting that done in 2.20 should be enough.
The mere fact that we are still discussing fundamental questions
well into the release freeze period means that these changes are
not sufficiently baked. Since there is in fact no true urgency of
any kind, we should not delay this release further. We should
simply make a release that is safely backward compatible, and
address the whole set of TLS issues for the next release.
(*) Except for ChromeOS, where there is nothing believed to be
affected by the issue; and for Native Client, which binaries I
maintain and can speak for authoritatively, and Native Client does
not support signals, so the issue is moot.
7. On further reflection I am not so convinced that any "middle road"
is actually worth pursuing, although I won't object to one that
meets the criteria for careful backward compatibility.
If any change to the status quo ante (2.18) is warranted, it must
be an "opt in" change. That is, existing binaries and existing
programs recompiled unchanged will get the existing behavior.
Programs need to do something explicit at compile time, link time,
or run time to opt in to using the signal-safe allocator.
Since we don't think the signal-safe allocation approach is what we
really want in the long run, it's hard to imagine any new opt-in
method we'd want to add to the public ABI. Of course we could add
something that becomes a no-op later, but it doesn't seem
worthwhile to add that bloat.
Google's production servers are using Google-private binaries built
against a Google-private modified glibc. So for them it would be
adequate to have some unofficial ABI, such as GLIBC_PRIVATE symbols
or the like. But given that Google is modifying glibc anyway, I
don't see actual rationale for putting anything like that into
glibc proper. Google can just as easily use a small patch in its
glibc builds, either one that provides the opt-in interface or one
that just changes the behavior. There is nothing wrong with making
sure such a patch is trivial by leaving the code we've added in as
dead code, or at the very least leaving in the changes that make
all the lazy TLS allocations go through a special set of entry
points so they're easy to catch.
Finally, another option is to allow opt-in when building libc.
That is, a configure switch to enable using the new
allocator--which must default to off, preserving compatible
semantics. I think this is a somewhat bad idea, but it is
straightforward and very little work to implement. Any distro that
uses --enable-breaking-abi-compatibility-for-arcane-new-tls-feature
is doing a disservice to its users. But that's their decision to
make, and if they're going to make it, there's no reason we should
force them to use a trivial patch instead of a trivial
configuration change. Given how committed everyone else here has
been to being wrong about the subject, this seems like the path
most likely to achieve consensus. I don't think anybody's case
against just leaving 2.19 unconditionally behaving as the feature
has always behaved will be even slightly convincing, but the
configure-switch cop-out seems the most likely to be acceptable to
all maintainers holding strong opposing views.
Thanks,
Roland