This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races

From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: <nd at arm dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Tue, 6 Dec 2016 17:22:51 +0000
Subject: Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
Authentication-results: sourceware.org; auth=none
Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Szabolcs dot Nagy at arm dot com;
Nodisclaimer: True
References: <583C7F76.3000601@arm.com> <1480965875.14990.199.camel@redhat.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

On 05/12/16 19:24, Torvald Riegel wrote:
> On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote:
>> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
>> dl_tls_dtv_slotinfo_list entries) are written by dlopen and
>> pthread_create reads them when initializing the dtv of a new thread.
>>
>> Current concurrency design:
> 
> Generally, I'd suggest to start one level of abstraction higher: What
> are the abstract operations that we need to perform?  Which of these
> need to appear to be atomic wrt. which other operations?  Which ordering
> relations do need to hold in every execution?
> (In contrast, you explain things below closer to the implementation
> level (eg, this lock is acquired for these operations, etc.))

i know you wanted this the last time too, but i don't plan to
rewrite the dynamic linker from scratch, i'm trying to make a
minimal change that is acceptable without years of work.

the current design does not have a clean set of invariants and
interface contracts, so it's hard to describe at a higher level
abstraction: there is always yet another special case that does
not fit into a simple model and nailing down all those special
cases is a lot more effort than necessary.

ignoring the implementation and only describing the user visible
requirements is not enough: glibc has lot of extensions and
non-standard behaviours users may depend on (so whatever the
current implementation does is the spec), following the standard
would not end up with something compatible with the current code
and it would still take a lot of time.

that's why my approach was to make a small change and trying to
reason about the changed semantics compared to the original
behaviour instead of figuring out all the atomicity and ordering
requirements (which can be specified in many ways).

>> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
>> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
>> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
>> held (e.g. concurrent dlclose and dlopen between reading the .gen and
>> .map members of the slotinfo entry), this affects:
>> 	pthread_create (_dl_allocate_tls_init, reads .gen and .map)
>> 	__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)
> 
> Can we version the modids?  Or do we generally need a mechanism for
> atomic snapshots?  What are the abstract operations and the requirements
> for them?

this is probably not such a big issue as i originally thought,
reuse of .gen and .map is possible to detect, because gen can
only increase (so it can be compared to a previous read of the
global gen counter and ignored if bigger, if smaller then it
can be guaranteed that .map and .gen are consistent).

>> (3) tls access must be as-safe, but currently tls is allocated at first
>> access if the containing module was not loaded at the time the accessing
>> thread was created.  This means __tls_get_addr may malloc or lock
>> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
>> oom crashes or memory corruption if it is in a signal handler (malloc in
>> signal handler is also problematic for malloc interposition).
> 
> Are signal handlers allowed to access TLS?  I don't know what POSIX
> requires (or doesn't disallow), but C++, for example, is moving to not
> allowing TLS (if I remember the draft wording correctly).

if that's true it is just another brokenness in c++,
i would not worry about it since dynamic loading of
libraries and tls is unfixably broken in c++ anyway.

meanwhile the rest of the world needs a way to detect
reentry into an as-safe library call which currently
requires as-safe tls (signal mask is not an option:
syscalls are too slow) and a way to do soft float
correctly (fenv is tls and arithmetics must be as-safe)
or just access errno (which could be made as-safe even
if tls access is not as-safe otherwise, but specifying
that correctly in the standard is more hassle than
requiring all tls access to be as-safe).

>> (4) allocation failures can oom crash in pthread_create and dlopen
>> (_dl_resize_dtv) and at tls access (various mallocs) or may leave
>> inconsistent state behind in dlopen (_dl_add_to_slotinfo or any
>> later call in dlopen that fails: tls changes are not rolled back).
> 
> Do we need to be able to publish changes atomically?  If so, there are
> known techniques for that.
> 

that may not be necessary.. i think this can be solved
without being atomic if some resource leaks are allowed
(an argument can be made that it's fine, it's one of
those special case decisions).

> Do we perhaps can deal with most of all of this if we had simple
> transactions?  If these aren't arbitrary operations but in some way
> limited, and can be used for lots of the operations we need to
> synchronize, this may perhaps be the easiest way to implement all of
> this.  Once we know about the abstract operations, we can better assess
> whether that would be a helpful approach.

you can think of dlopen (up to ctors) as a transaction..
but there are mallocs, mmaps and even debug messages that
are not transactional operations and some state does not
have to be rolled back (e.g. generation count increment)
so i don't think this helps.

>> (5) GL(dl_tls_generation) is not checked for overflow consistently and
>> slotinfo entries can get assigned an overflowed generation count, its type
>> is size_t and both dlopen and dlclose increment it, so overflow may be a
>> concern on 32bit targets.
> 
> Can the size be extended on 32b targets?  If so, we can build a
> monotonically increasing, larger atomic counter on 32b targets too.

that would need some changes

the generation count up to which a dtv is set up in a
thread is stored in dtv[0] which is 32bit on a 32bit
system.

making the dtv 64bit will break anything that tries to
look at the dtv (asan?), hacking it around by using dtv[-1]
is possible but not elegant (dtv length is already stored
there it just have to be moved to dtv[-2]).

>> (6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors
>> wait for anything that may take the same lock there is a deadlock.
> 
> Do you mean dtors?
> 

sorry, correctly:
dlopen holds the lock while running ctors and
dlclose holds the lock while running dtors.

>> Proposed fix(es):
> 
> Before focusing on taking the lock, I'd suggest to look at the bigger
> picture (ie, the abstract operations and atomicity and ordering
> requirements).
> It can be possible that a clean redesign is actually simpler than trying
> to fix a system that is already complex and which we don't understand
> 100%.

i think that's optimistic but i can write up what i think
the requirements are.

Follow-Ups:
- Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  - From: Joseph Myers
- Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  - From: Torvald Riegel

References:
- Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]