This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, <libc-alpha at sourceware dot org>
- Cc: <nd at arm dot com>
- Date: Mon, 18 Jan 2016 20:09:19 +0000
- Subject: Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
- Authentication-results: sourceware.org; auth=none
- Nodisclaimer: True
- References: <568D5E11 dot 3010301 at arm dot com> <5693D908 dot 8090104 at arm dot com> <5695B7BF dot 7050000 at redhat dot com> <5697BA1F dot 6010801 at arm dot com> <569D282E dot 8000005 at linaro dot org>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:23
On 18/01/16 18:00, Adhemerval Zanella wrote:
On 14-01-2016 13:09, Szabolcs Nagy wrote:
On 13/01/16 02:34, Carlos O'Donell wrote:
On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
2016-01-11 Szabolcs Nagy <firstname.lastname@example.org>
* elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
* elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
(_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
You are headed in the right direction. I like where this patch is going,
but don't have enough time to review this in detail yet.
will there be time before 2.23?
I intend to send a message about the hard freeze today and the question is
if this fix is suitable to not show regression in the following mount and
if so if we could iron out the bugs. So are you confident about that?
I am asking you this because this synchronization issues are a nest of rats
sometimes and I also noted you found out multiple issues during the patch
sending and reviewing. Objections?
i'm confident that i can fix the races i can
trigger (new threads vs dlopen), without creating
regressions (other than performance regression).
adding a handful of atomics is enough for that
(it should not alter the behaviour when there is
no conflicting memory access), however it might
not be the best solution (too conservative) or i
might miss a few races etc.
At first glance your patch lacks sufficient concurrency documentation to
be acceptable. You need to document which acquires the releases
synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
of detail we need to maintain these kinds of changes.
i wanted to avoid documenting all the mess in the dynamic linker,
but i can improve the comments.
i see the following bugs:
1) pthread_create is not synced with dlopen
2) pthread_create is not synced with dlclose
3) tls access is not synced with dlopen
4) tls access is not synced with dlclose
5) tls access can oom crash
6) tls access is not as-safe
7) dlopen holds a global lock during ctors
i can fix 1) by adding some atomics (this is the current patch).
for 2) that's not enough, because dlclose has to wait with the
free(link_map) while pthread_create is initializing the tls.
So which would be a better approach regarding it?
i don't know what's the best approach
one possibility is to not free memory in dlclose
(i guess glibc does not want this).
another is to sync e.g. by using some lock, but it
cannot be GL(dl_load_lock) now because that is held
during ctors are running in dlopen which may call
of course using both atomics and locks is a bit
i guess 3) can be fixed similarly to 1) but i don't have a
test case for that.
the rest is harder to fix.
is it ok to only fix 1) for 2.23?
or should i try to fix 3) as well (races on the same globals)?
I would say we attack one problem at time, even when they are related.
How hard do you consider to fix 3.?
i haven't looked at that in detail, it involves more
code starting from __tls_get_addr, but it is probably
less critical because the code tries to only look at
slotinfo entries up to a point (module's gen number where
the accessed tls is) that is guaranteed to be synced by
a dlopen that returned.
(which means less problems with concurrently loading
modules, however GL(*) should be accessed with atomics).
i can try to cook up something tomorrow with a detailed
concurrency note and then others can decide if it's
too risky change or not.