TLS redux

Wed Jan 15 02:23:00 GMT 2014

I've finally caught up on the long threads about TLS issues.
(The good news is that this was a sizable fraction of all of my
libc-related backlog, so I'm much less behind than I was before!)

Other people have discussed many of the issues that I would have
raised if I'd participated all along, but not all of them.  I won't
summarize the whole discussion, but just mention the things I think
it's important not to overlook.  I don't really have anything to say
about most of the implementation details.  Only the last point or two
are issues about the changes being considered for 2.19.

* Lazy allocation is an explicit feature of the TLS ABI, not an
  incidental detail.  The wisdom of the feature can be debated, but
  the compatibility requirements are clear.

  It's a regression if this scenario stops working:
  1. Start a thousand threads
  2. dlopen a module containing __thread char buf[100 << 20];
  3. Start another thousand threads
  4. Call into the module on one thread so it uses its buf.
  5. Start a third thousand threads
  Now you should have 3000 threads but not 3000*100M memory use.
  (Here I mean address space reservation, regardless of consumption
  of machine resources, VM overcommit, etc.)

  At least in the case of an existing binary dlopen caller (which
  could actually be either in an executable or in a DSO) and an
  existing binary module loaded by that dlopen, such a regression is
  an ABI break and cannot be tolerated.

* It's inherently impossible to both allocate lazily and have dynamic
  TLS access that cannot fail.  Either you preallocate the memory
  (eager use of address space, if not necessarily actual storage) or
  attempting to allocate it later might fail.  Hence it must be an
  explicit choice between the two.  That choice might be at the
  granularity of the whole implementation, as in musl, or all the way
  down to the granularity of an individual TLS-containing module or
  individual module-loading call.  Since glibc has a compatibility
  requirement to support lazy allocation, the only possibilities for
  the contrary choice are at smaller granularities.

* Eager allocation could be a new option, and could even be a new
  default.  (What the default should be is a separate debate that does
  not need to begin now.)
** e.g. A new DF_1_* flag and -z option for a DSO to request it.
*** Could be made default for newly-built DSOs.
** New dlopen flag bits to request it.
*** Could be made default for newly-built dlopen callers (i.e. new
    symbol version of dlopen).

* In implementing eager allocation when multiple threads already
  exist, it is theoretically possible to do all or almost all of it
  asynchronously (i.e. all work done inside the dlopen call on the
  thread that called it).  It's trickiest, or perhaps impossible, to
  do the final step if the DTV needs to be expanded, from another
  thread.  But there is not really any good reason to do a lot from
  other threads.  Rich Felker described the most sensible
  implementation strategy: do all the allocation in dlopen, but only
  actually install those new pointers synchronously in each thread,
  inside __tls_get_addr.

* The main request for async-signal-safe TLS use is satisfied by "fail
  safe" semantics that preserve lazy allocation semantics: if the
  memory is really not available, then you crash gracefully in
  __tls_get_addr.  (That is, as much grace as abort, as opposed to the
  full range of "undefined behavior" or anything like deadlock.)

* How to find all memory containing direct application data is a de
  facto part of our ABI.  By "direct" I mean objects that the
  application touches itself.  That includes __thread variables just
  as it includes global, static, and auto variables.  It excludes
  library-maintained caches and the like, but includes any user data
  that the public API implies the library holds onto, such as pointers
  stored by <search.h> functions.

  This is a distinct issue from the general subject of "using an
  alternate allocation mechanism for memory" that Carlos mentioned.
  If libc changes how and where it stores its own internal data, that
  does not impinge on anything that is a de facto part of the ABI.  If
  libc changes how and where it stores application TLS data or other
  things in the aforementioned category, that is another thing entirely.

  I mentioned ASan as just one example of the kinds of things that
  might care about these aspects of the de facto ABI.  Things like
  ASan and conservative GC implementations are the obvious examples.
  But the fundamentals of conservatism dictate that we not make a
  priori assumptions about what our users are doing and what matters
  to them.  As with all somewhat fuzzy aspects of the ABI, there will
  be a pragmatic balancing test between "I was using that, you can't
  break it!" and, "You were broken to have been relying on that."  But
  we must consider it explicitly, discuss it pragmatically, and be
  circumspect about changes, especially the subtle ones.  The change
  at issue here is especially subtle in that it could be a silent time
  bomb that does not affect anybody in practice (or that nobody
  realizes explains strange new flakiness they experience) for
  multiple release cycles.  For example, if before the change a
  __thread variable (in a dynamic TLS module) sometimes was the only
  root holding a GC'able pointer and the GC noticed it there, but
  after the change the GC doesn't see that root.  If this bug is
  introduced tomorrow, it could be a long time before the confluence
  of when collections happen, whether other objects hold (or appear to
  hold) the same pointer, and the effects of reclamation, add up to
  make someone experience a failure they notice.

  How to find threads' stacks and static TLS areas is already
  underspecified (improving that situation is a subject for another
  discussion).  But even for that, we would be quite circumspect about
  making a change that could break methods existing programs are using
  to acquire that information.

  Today, dynamic TLS areas are allocated using the public malloc
  interface.  Programs or GC libraries or whatnot can today supply a
  malloc replacement, scan the static data+bss area, scan each
  thread's stack and static TLS area, and reliably discover every byte
  of user data in any TLS area for any thread.  We never documented an
  explicit guarantee that this works, but it does and to the extent
  that anything extant relies on it (whether or not its maintainers
  realize they do!), it is part of the de facto ABI.

  I don't have a firm conclusion about what we guarantees of this sort
  we should or should not be offering or preserving.  But this change
  affects that part of the de facto ABI and as far as I noticed nobody
  has discussed it at all.  That fails the conservatism test instantly.

I have no great quarrel with the thoroughness or conservatism of the
vetting of the implementation details or first-order ABI issues of
what's gone in.  (I am not entirely sanguine about all that, but close
enough that I've decided not to participate in the detailed review.)
But the mere fact that in a few months a >100 messages of discussion,
I'm the first to raise these subtleties (that I really thought would
have been fairly obvious to people here) gives me great pause about
the whole endeavor.

Similarly, Carlos expressed an attitude that I'll summarize as, "So we
break ASan for a release or three and fix it later, no big deal."
That is fundamentally anti-conservative IMHO.  Indeed, ASan is not
part of glibc.  If it were, we'd be able to achieve complete
confidence about all its issues very quickly.  ASan is an example of
the wide variety of things users are doing with glibc, that we have an
obligation never to break silently or inadvertently.

Dynamic TLS access not being async-signal-safe has been the status quo
since the inception of the TLS features.  Leaving that as it is for
another release is just obviously acceptably conservative.
Contrarily, breaking other kinds of subtle interaction with TLS
features that have worked in practice heretofore is not conservative
at all.

As I said, I'm not specifying any conclusions.  I'm fairly confident
we can find a middle road that is appropriately conservative while
offering improvement for the pain point.  But we have yet to even
begin discussing what IMHO should be considered a major obstacle to
making this change while keeping with our conservative principles.

Thanks,
Roland