Bug 18457 - pthread_join deadlock in library destructor
Summary: pthread_join deadlock in library destructor
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.21
: P2 normal
Target Milestone: 2.22
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 17090 17620 17621 17628
  Show dependency treegraph
 
Reported: 2015-05-27 14:26 UTC by Andreas Schwab
Modified: 2020-07-05 11:34 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Patch that fixes the regression (2.63 KB, patch)
2015-06-03 05:09 UTC, Alexandre Oliva
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Schwab 2015-05-27 14:26:44 UTC
A dlopen'd library that is unloaded and calls pthread_join in the destructor causes a deadlock while waiting for the child to finish.

$ cat pthread-join.c 
#include <dlfcn.h>

int
main (void)
{
  void *f = dlopen ("pthread-join-test.so", RTLD_NOW | RTLD_GLOBAL);
  if (f) dlclose (f);
}
$ cat pthread-join-test.c 
#include <stdio.h>
#include <pthread.h>

static pthread_t th;
static int running = 1;

static void *
test_run (void *p)
{
  while (running)
    fprintf (stderr, "XXX test_run\n");
  fprintf (stderr, "XXX test_run FINISHED\n");
  return NULL;
}

static void __attribute__ ((constructor))
do_init (void)
{
  pthread_create (&th, NULL, test_run, NULL);
}

static void __attribute__ ((destructor))
do_end (void)
{
  running = 0;
  fprintf (stderr, "thread_join...\n");
  pthread_join (th, NULL);
  fprintf (stderr, "thread_join DONE\n");
}
$ gcc -fPIC -shared pthread-join-test.c -lpthread -o pthread-join-test.so
$ gcc pthread-join.c -lpthread -ldl -Wl,-path=. -o pthread-join
$ ./pthread-join
thread_join...
XXX test_run FINISHED


This was broken by f8aeae3 ("Fix DTV race, assert, DTV_SURPLUS Static TLS limit, and nptl_db garbage").
Comment 1 Alexandre Oliva 2015-06-03 05:09:00 UTC
Created attachment 8340 [details]
Patch that fixes the regression

Thanks for the report . Here's the patch I'm testing to fix the regression.
Comment 2 Carlos O'Donell 2015-07-02 16:11:38 UTC
(In reply to Alexandre Oliva from comment #1)
> Created attachment 8340 [details]
> Patch that fixes the regression
> 
> Thanks for the report . Here's the patch I'm testing to fix the regression.

We are still looking to fix this in 2.22, but Alex's patch may not be the solution that is decided upon. Initially as a conservative fix we may just not uload the DSO that have destructors used by thread_local. Then in 2.23 we will put in a broader fix that resolves and documents the concurrency requirements for the internal glibc interfaces.
Comment 3 Alexandre Oliva 2015-07-06 21:13:20 UTC
Don't even consider the patch, it's wrong.  We can't assume that, just because l_tls_offset is set, the TLS block has been set up for all threads; a concurrent dlopen may have already set l_tls_offset for a previously-loaded but TLS-undecided module, but not yet completed initializing the TLS blocks for all threads.  Taking the lock guarantees we wait for that initialization to complete.  In order to make this work, we'd have to have dlopen to set l_tls_offset to some value indicating initialization is underway and must be waited for, overwriting it with the final value that obviates the lock only after the initialization is done.

I think we're better off leaving this as is, and focusing on ways to run module init and fini without holding the rtld lock.
Comment 4 Sourceware Commits 2015-07-24 13:44:49 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  e400f3ccd36fe91d432cc7d45b4ccc799dece763 (commit)
      from  48f5f7a63c5971a698ac2b97e4c383e04cbbe110 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e400f3ccd36fe91d432cc7d45b4ccc799dece763

commit e400f3ccd36fe91d432cc7d45b4ccc799dece763
Author: Siddhesh Poyarekar <siddhesh@redhat.com>
Date:   Fri Jul 24 19:13:38 2015 +0530

    Use IE model for static variables in libc.so, libpthread.so and rtld
    
    The recently introduced TLS variables in the thread-local destructor
    implementation (__cxa_thread_atexit_impl) used the default GD access
    model, resulting in a call to __tls_get_addr.  This causes a deadlock
    with recent changes to the way TLS is initialized because DTV
    allocations are delayed and hence despite knowing the offset to the
    variable inside its TLS block, the thread has to take the global rtld
    lock to safely update the TLS offset.
    
    This causes deadlocks when a thread is instantiated and joined inside
    a destructor of a dlopen'd DSO.  The correct long term fix is to
    somehow not take the lock, but that will need a lot deeper change set
    to alter the way in which the big rtld lock is used.
    
    Instead, this patch just eliminates the call to __tls_get_addr for the
    thread-local variables inside libc.so, libpthread.so and rtld by
    building all of their units with -mtls-model=initial-exec.
    
    There were concerns that the static storage for TLS is limited and
    hence we should not be using it.  Additionally, dynamically loaded
    modules may result in libc.so looking for this static storage pretty
    late in static binaries.  Both concerns are valid when using TLSDESC
    since that is where one may attempt to allocate a TLS block from
    static storage for even those variables that are not IE.  They're not
    very strong arguments for the traditional TLS model though, since it
    assumes that the static storage would be used sparingly and definitely
    not by default.  Hence, for now this would only theoretically affect
    ARM architectures.
    
    The impact is hence limited to statically linked binaries that dlopen
    modules that in turn load libc.so, all that on arm hardware.  It seems
    like a small enough impact to justify fixing the larger problem that
    currently affects everything everywhere.
    
    This still does not solve the original problem completely.  That is,
    it is still possible to deadlock on the big rtld lock with a small
    tweak to the test case attached to this patch.  That problem is
    however not a regression in 2.22 and hence could be tackled as a
    separate project.  The test case is picked up as is from Alex's patch.
    
    This change has been tested to verify that it does not cause any
    issues on x86_64.
    
    ChangeLog:
    
    	[BZ #18457]
    	* nptl/Makefile (tests): New test case tst-join7.
    	(modules-names): New test case module tst-join7mod.
    	* nptl/tst-join7.c: New file.
    	* nptl/tst-join7mod.c: New file.
    	* Makeconfig (tls-model): Pass -ftls-model=initial-exec for
    	all translation units in libc.so, libpthread.so and rtld.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                              |   10 +++++++
 Makeconfig                             |    6 +++-
 NEWS                                   |   12 ++++----
 nptl/Makefile                          |   10 +++++-
 nptl/tst-join7.c                       |   46 ++++++++++++++++++++++++++++++++
 nptl/{tst-detach1.c => tst-join7mod.c} |   46 ++++++++++++++++++--------------
 6 files changed, 101 insertions(+), 29 deletions(-)
 create mode 100644 nptl/tst-join7.c
 copy nptl/{tst-detach1.c => tst-join7mod.c} (52%)
Comment 5 Siddhesh Poyarekar 2015-07-24 13:45:57 UTC
Fixed in master.
Comment 6 jack 2020-07-05 11:34:14 UTC Comment hidden (spam)