28357 – deadlock between pthread_create and ctors

Bug 28357 - deadlock between pthread_create and ctors

Summary: deadlock between pthread_create and ctors

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	dynamic-link (show other bugs)
Version:	2.34

Importance:	P2 normal
Target Milestone:	2.35
Assignee:	Szabolcs Nagy

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-09-20 16:43 UTC by Szabolcs Nagy
Modified:	2024-05-16 07:27 UTC (History)
CC List:	1 user (show)

See Also:	19329
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Szabolcs Nagy 2021-09-20 16:43:34 UTC

glibc 2.34 introduced GL(dl_load_lock) in pthread_create to fix
bug 19329.

ctors in dlopen run while GL(dl_load_lock) which can cause
deadlocks as explained in bug 15686. that bug is hard to fix so
the deadlock at thread creation is dealt with separately here.

copying the deadlock report from bug 19329:

If I have a c++ dynamic library(named libA.so) that contains a global object, the global object will call the post-constructor at initialization and hold it's own lock(named A_lock) when dlopen loads libA.so. Assume that two threads execute the following process：
    Thread1:dlopen(libA.so) => hold dl_load_lock => load libA.so => init global 
            object from libA.so => wait for hold A_lock
    Thread2:my own code hold A_lock => pthread_create => _dl_allocate_tls_init 
            => wait for hold dl_load_lock
In this case, an ABBA deadlock occurs. Is this a bug?

My stack looks like this:
Thread 1 (LWP 136013):
#0  0x00007f57a108510d in ?? () from /usr/lib64/libpthread.so.0
#1  0x00007f57a107e4d1 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0
#1  stack waiting for holding A_lock
...
#6  0x00007f5781c1bb8b in LogProcess::Init (strProcName=..., nProcHandle=nProcHandle@entry=0) at ./service/biz_frame/code/server/src/logging/logprocess.cpp:107
...
#20 0x00007f57a0fef21f in _dl_catch_exception () from /usr/lib64/libc.so.6
#21 0x00007f57a786442b in ?? () from /lib64/ld-linux-x86-64.so.2
#22 0x00007f57a3de2296 in ?? () from /usr/lib64/libdl.so.2
#23 0x00007f57a0fef21f in _dl_catch_exception () from /usr/lib64/libc.so.6
#24 0x00007f57a0fef2af in _dl_catch_error () from /usr/lib64/libc.so.6
#25 0x00007f57a3de2985 in ?? () from /usr/lib64/libdl.so.2
#26 0x00007f57a3de2351 in dlopen () from /usr/lib64/libdl.so.2
...
...
#38 0x00007f57a0fb3520 in clone () from /usr/lib64/libc.so.6

Thread 2 (LWP 134627):
#0  0x00007f57a108510d in ?? () from /usr/lib64/libpthread.so.0
#1  0x00007f57a107e580 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0
#2  0x00007f57a7863835 in _dl_allocate_tls_init () from /lib64/ld-linux-x86-64.so.2
#3  0x00007f57a107cb7c in pthread_create () from /usr/lib64/libpthread.so.0
...
#10 Stack holding A_lock
...
#14 0x0000561689e0d579 in main ()

Comment 1 Sourceware Commits 2021-10-04 14:12:03 UTC

The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83b5323261bb72313bffcf37476c1b8f0847c736

commit 83b5323261bb72313bffcf37476c1b8f0847c736
Author: Szabolcs Nagy <szabolcs.nagy@arm.com>
Date:   Wed Sep 15 15:16:19 2021 +0100

    elf: Avoid deadlock between pthread_create and ctors [BZ #28357]
    
    The fix for bug 19329 caused a regression such that pthread_create can
    deadlock when concurrent ctors from dlopen are waiting for it to finish.
    Use a new GL(dl_load_tls_lock) in pthread_create that is not taken
    around ctors in dlopen.
    
    The new lock is also used in __tls_get_addr instead of GL(dl_load_lock).
    
    The new lock is held in _dl_open_worker and _dl_close_worker around
    most of the logic before/after the init/fini routines.  When init/fini
    routines are running then TLS is in a consistent, usable state.
    In _dl_open_worker the new lock requires catching and reraising dlopen
    failures that happen in the critical section.
    
    The new lock is reinitialized in a fork child, to keep the existing
    behaviour and it is kept recursive in case malloc interposition or TLS
    access from signal handlers can retake it.  It is not obvious if this
    is necessary or helps, but avoids changing the preexisting behaviour.
    
    The new lock may be more appropriate for dl_iterate_phdr too than
    GL(dl_load_write_lock), since TLS state of an incompletely loaded
    module may be accessed.  If the new lock can replace the old one,
    that can be a separate change.
    
    Fixes bug 28357.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

Comment 2 Szabolcs Nagy 2021-10-04 14:19:12 UTC

fixed for glibc 2.35, will backport it later.

Comment 3 Sourceware Commits 2021-10-19 12:23:00 UTC

The release/2.34/master branch has been updated by Florian Weimer <fw@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=024a7640ab9ecea80e527f4e4d7f7a1868e952c5

commit 024a7640ab9ecea80e527f4e4d7f7a1868e952c5
Author: Szabolcs Nagy <szabolcs.nagy@arm.com>
Date:   Wed Sep 15 15:16:19 2021 +0100

    elf: Avoid deadlock between pthread_create and ctors [BZ #28357]
    
    The fix for bug 19329 caused a regression such that pthread_create can
    deadlock when concurrent ctors from dlopen are waiting for it to finish.
    Use a new GL(dl_load_tls_lock) in pthread_create that is not taken
    around ctors in dlopen.
    
    The new lock is also used in __tls_get_addr instead of GL(dl_load_lock).
    
    The new lock is held in _dl_open_worker and _dl_close_worker around
    most of the logic before/after the init/fini routines.  When init/fini
    routines are running then TLS is in a consistent, usable state.
    In _dl_open_worker the new lock requires catching and reraising dlopen
    failures that happen in the critical section.
    
    The new lock is reinitialized in a fork child, to keep the existing
    behaviour and it is kept recursive in case malloc interposition or TLS
    access from signal handlers can retake it.  It is not obvious if this
    is necessary or helps, but avoids changing the preexisting behaviour.
    
    The new lock may be more appropriate for dl_iterate_phdr too than
    GL(dl_load_write_lock), since TLS state of an incompletely loaded
    module may be accessed.  If the new lock can replace the old one,
    that can be a separate change.
    
    Fixes bug 28357.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
    (cherry picked from commit 83b5323261bb72313bffcf37476c1b8f0847c736)