654 – Cancelling nptl thread on dlclose() leads to application hangup

Bug 654 - Cancelling nptl thread on dlclose() leads to application hangup

Summary: Cancelling nptl thread on dlclose() leads to application hangup

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	nptl (show other bugs)
Version:	2.3.4

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ulrich Drepper

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-01-12 10:47 UTC by Alexei Khlebnikov
Modified:	2019-04-10 09:23 UTC (History)
CC List:	4 users (show)

See Also:
Host:	i686-pc-linux-gnu
Target:	i686-pc-linux-gnu
Build:	i686-pc-linux-gnu
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Testcase for the bug. (1.41 KB, application/octet-stream) 2005-01-12 10:49 UTC, Alexei Khlebnikov	Details
Proposed patch, the first try. (453 bytes, patch) 2005-01-17 12:47 UTC, Alexei Khlebnikov	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexei Khlebnikov 2005-01-12 10:47:19 UTC

Overview Description:
The program loads a module (.so-library) using dlopen(). During this action a
global C++-object is created. No matter how is it created - as a global stack
variable or as a new()-ly created object using __attribute__ ((constructor))
function - in either case the bug is triggered. The constructor of this object
spawns a thread. Then the program unloads the dynamically-loaded module. A
destructor of the mentioned object is called, it calls a function, which tries
to cancel the mentioned spawned thread. The thread is of type
PTHREAD_CANCEL_DEFERRED and periodically checks for its cancelling by
pthread_testcancel(), so it catches the the cancellatiob request. The main
thread calls pthread_join() to join the second thread and the whole program
hangs up! If the function which cancel the second thread is called explicitly
(not from the destructor) before the module unloading, the second thread cancels
and joins fine.


Steps to Reproduce:
1) Unpack the attached tarball. It is the trimmed-down testcase
of the actual big application.
2) Run "./compile" to compile the test program and the module.
3) Run "./run" to see messages and the program hangup.
4) Press Ctrl-C to reclaim the command prompt.
5) Run "./test ./libmodule.so foo" to see a normal program behaviour
in case of explicit thread cancelling.


Actual Results:
1) Output of running "./run" or "./test ./libmodule.so".
---
$ ./run
loading ./libtestmod.so now
Constructor called
hi there, new thread is up and running, thread id is -1210377296
Constructor finished
pureShutdown::func(void*) called
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
unloading ./libtestmod.so now
Destructor called
modShutdown() called
bye, cancelling down thread -1210377296
running pthread_join(g_tid, &result) ...
---
(the program hangs here)

2) Output of running "./test ./libmodule.so foo".
---
$ ./test ./libtestmod.so foo
loading ./libtestmod.so now
Constructor called
hi there, new thread is up and running, thread id is -1210377296
Constructor finished
pureShutdown::func(void*) called
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
modShutdown() called
bye, cancelling down thread -1210377296
running pthread_join(g_tid, &result) ...
returned from pthread_join(g_tid, &result) !
all's well that end's well
modShutdown() finished
unloading ./libtestmod.so now
Destructor called
modShutdown() called
Destructor finished
---
(the program exits with code 0 here)

3) GDB session of the first case (running "./run" or "./test ./libmodule.so").
---
$ gdb
GNU gdb 6.1.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
(gdb) file ./test
Reading symbols from ./test...done.
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) run ./libtestmod.so
Starting program: /home/ses/test/test ./libtestmod.so
[Thread debugging using libthread_db enabled]
[New Thread -1210374480 (LWP 2594)]
loading ./libtestmod.so now
Constructor called
[New Thread -1210377296 (LWP 2597)]
hi there, new thread is up and running, thread id is -1210377296
Constructor finished
pureShutdown::func(void*) called
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
unloading ./libtestmod.so now
Destructor called
modShutdown() called
bye, cancelling down thread -1210377296
running pthread_join(g_tid, &result) ...

Program received signal SIG32, Real-time event 32.
[Switching to Thread -1210377296 (LWP 2597)]
0xffffe410 in ?? ()
(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xb7db1468 in ?? ()
#2  0xb7fd6ff8 in ?? () from /lib/libpthread.so.0
#3  0x00000000 in ?? ()
#4  0xb7fd2cf6 in __nanosleep_nocancel () from /lib/libpthread.so.0
#5  0xb7fe3ddd in pureShutdown::func () at module.cpp:71
#6  0xb7fcd3c0 in start_thread () from /lib/libpthread.so.0
#7  0xb7e6c24e in clone () from /lib/libc.so.6
(gdb) kill
Kill the program being debugged? (y or n) y
(gdb) quit
$
---
kill -l haven't print what the SIG32 is. Google said that it is SIGTRAP.


Expected Results: the program in the first case should not hang up, but the
second thread should terminate correctly, the module should be unloaded
correctly and the whole program should exit with code 0.


Build Date: 2004-01-12


System information:
Processor: Pentium III (Coppermine) 667.080 Mhz
Distribuition: Linux From Scratch 6.0 with RPM and some packages updated
Kernel version: 2.6.9, unpatched AFAIK

Glibc version: 
Snapshot of 2005-01-10 from
ftp://sources.redhat.com/pub/glibc/snapshots/glibc-20050110.tar.bz2:
---
GNU C Library development release version 2.3.90, by Roland McGrath et al.
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.4.3.
Compiled on a Linux 2.6.9 system on 2005-01-11.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
---
Sorry for not trying the latest CVS. I haven't got an access to the outside
network CVS from my corporate network. And judjing from
[glibc]/libc/nptl/ChangeLog on CvsWeb, nothing changed during the last 2 days in
the nptl.

Glibc "./configure" switches (excluding "--*dir=" switches):
---
    --disable-profile \
    --enable-add-ons=nptl \
    --with-tls \
    --with-__thread \
    --enable-kernel=2.6.9 \
    --without-cvs \
    --with-headers=/usr/src/linux-2.6.9/include
---
Glibc was built into the rpm packages with the aid of rpm.

GCC version:
---
$ gcc -v
Reading specs from /usr/lib/gcc/i686-pc-linux-gnu/3.4.3/specs
Configured with: ../gcc-3.4.3/configure --host=i686-pc-linux-gnu
--build=i686-pc-linux-gnu --target=i686-pc-linux-gnu --prefix=/usr
--exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
--datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib
--libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/usr/com
--mandir=/usr/share/man --infodir=/usr/share/info --enable-shared
--enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
--enable-languages=c,c++
Thread model: posix
gcc version 3.4.3
---

Ld/Binutils version:
---
$ ld -v
GNU ld version 2.15.91.0.1 20040527
---

Hoping, all the provided information will help. If you need more info - please
feel free to ask. Also feel free to request additional testing/investigation.
And also an advice would be helpful how to write the patch myself.

Comment 1 Alexei Khlebnikov 2005-01-12 10:49:25 UTC

Created attachment 350 [details]
Testcase for the bug.

Comment 2 Alexei Khlebnikov 2005-01-13 12:30:57 UTC

I've tested the same testcase on another system, having kernel 2.4.20 and glibc
2.3.2 with linuxthreads. The program ran just fine. The test has been conducted
today, 2004-01-13.

The output:
---
$ ./run
loading ./libtestmod.so now
Constructor called
pureShutdown::func(void*) called
hi there, new thread is up and running, thread id is 16386
Constructor finished
= thread 16386 is still running...
= thread 16386 is still running...
= thread 16386 is still running...
= thread 16386 is still running...
unloading ./libtestmod.so now
Destructor called
modShutdown() called
bye, cancelling down thread 16386
running pthread_join(g_tid, &result) ...
returned from pthread_join(g_tid, &result) !
all's well that end's well
modShutdown() finished
Destructor finished
$
---

System information:
CPU: Intel(R) Xeon(TM) CPU 2.80GHz
Distribution: SuSE Linux 8.2
Kernel: 2.4.20-64GB-SMP, from the SuSE distribution

Glibc version:
---
$ /lib/libc.so.6
GNU C Library stable release version 2.3.2, by Roland McGrath et al.
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.3 20030226 (prerelease) (SuSE Linux).
Compiled on a Linux 2.4.20 system on 2003-03-13.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        NoVersion patch for broken glibc 2.0 binaries
        BIND-8.2.3-T5B
        libthread_db work sponsored by Alpha Processor Inc
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Report bugs using the `glibcbug' script to <bugs@gnu.org>.
---

GCC version:
---
$ gcc -v
Reading specs from /usr/lib/gcc-lib/i486-suse-linux/3.3/specs
Configured with: ../configure --enable-threads=posix --prefix=/usr
--with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man
--libdir=/usr/lib --enable-languages=c,c++,f77,objc,java,ada --disable-checking
--enable-libgcj --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib
--with-system-zlib --enable-shared --enable-__cxa_atexit i486-suse-linux
Thread model: posix
gcc version 3.3 20030226 (prerelease) (SuSE Linux)
---

Ld/Binutils version:
---
$ ld -v
GNU ld version 2.13.90.0.18 20030121 (SuSE Linux)
---

Comment 3 Alexei Khlebnikov 2005-01-13 12:52:21 UTC

I've investigated the problem further.
I've found a (not very precise) place in the libc where the hangup takes place.
It's the file nptl/pthread_join.c, line 86, which looks like
---
  /* Wait for the child.  */
  lll_wait_tid (pd->tid);
---

lll_wait_tid is a macro with assembler code which I don't understand so far:
---
/* The kernel notifies a process with uses CLONE_CLEARTID via futex
   wakeup when the clone terminates.  The memory location contains the
   thread ID while the clone is running and is reset to zero
   afterwards.

   The macro parameter must not have any side effect.  */
#define lll_wait_tid(tid) \
  do {									      \
    int __ignore;							      \
    register __typeof (tid) _tid asm ("edx") = (tid);			      \
    if (_tid != 0)							      \
      __asm __volatile (LLL_EBX_LOAD					      \
			"1:\tmovl %1, %%eax\n\t"			      \
			LLL_ENTER_KERNEL				      \
			"cmpl $0, (%%ebx)\n\t"				      \
			"jne,pn 1b\n\t"					      \
			LLL_EBX_LOAD					      \
			: "=&a" (__ignore)				      \
			: "i" (SYS_futex), LLL_EBX_REG (&tid), "S" (0),	      \
			  "c" (FUTEX_WAIT), "d" (_tid),			      \
			  "i" (offsetof (tcbhead_t, sysinfo)));		      \
  } while (0)
---

Comment 4 Jakub Jelinek 2005-01-13 13:15:08 UTC

This is the same deadlock as has been fixed by:
2004-07-07  Ulrich Drepper  <drepper@redhat.com>

        * elf/dl-fini.c (_dl_fini): Move the unlock of the ld.so lock
        before the loop running the destructors.
for destructors that are run at exit time.
ATM ld.so holds dl_load_lock when running shared library destructors and the
same lock is used indirectly by libgcc_s.so when unwinding.  If you call
pthread_cancel in a shared library destructor that is run during dlclose,
dl_load_lock is held in the thread calling pthread_cancel, but the cancelled
thread needs to be unwound.  As you also call pthread_join in the same destructor
that waits for the cancelled thread and the cancelled thread is waiting until
dl_load_lock is released (this would happen when dlclose is about to return),
they are deadlocking.

The fix is avoid running shared library destructors with dl_load_lock held,
but that's certainly not trivial.

Comment 5 Alexei Khlebnikov 2005-01-17 12:43:06 UTC

I've constructed a patch that unlocks dl_load_lock just before running the
destructors, and locks it again just after that. My testcase now runs properly,
but I don't know wether or not my patch has any side-effects. So, dear glibc
developers, please watch it and either confirm that the patch is correct or
point me where am I wrong. Thanks.

The patch:
---
--- glibc/elf/dl-close.c.orig   2005-01-09 09:27:52.000000000 +0100
+++ glibc/elf/dl-close.c        2005-01-17 15:04:52.000000000 +0100
@@ -265,6 +265,10 @@
       }
   assert (new_opencount[0] == 0);

+  /* Release dl_load_lock during running destructors,
+     like in dl-fini.c. */
+  __rtld_lock_unlock_recursive (GL(dl_load_lock));
+
   /* Call all termination functions at once.  */
 #ifdef SHARED
   bool do_audit = GLRO(dl_naudit) > 0 && !GL(dl_ns)[ns]._ns_loaded->l_auditing;
@@ -389,6 +393,9 @@
       assert (imap->l_type == lt_loaded || imap->l_opencount > 0);
     }

+  /* Destructors finished, acquire dl_load_lock again. */
+  __rtld_lock_lock_recursive (GL(dl_load_lock));
+
 #ifdef SHARED
   /* Auditing checkpoint: we will start deleting objects.  */
   if (__builtin_expect (do_audit, 0))
---

Comment 6 Alexei Khlebnikov 2005-01-17 12:47:22 UTC

Created attachment 369 [details]
Proposed patch, the first try.

This is the same patch as listed in the comment #5.

Comment 7 Ulrich Drepper 2006-05-02 22:03:38 UTC

Nothing related to C++, exceptions, and dlopen can be critical.

Comment 8 ZHANG, Le 2010-12-10 02:56:50 UTC

Hey, guys, I found this testcase can't trigger the bug in recent glibc.
How is this bug finally resovled?
I have searched glibc git commit message for dl_load_lock, but nothing showed up.

Comment 9 ZHANG, Le 2010-12-10 03:43:08 UTC

(In reply to comment #4)
> This is the same deadlock as has been fixed by:
> 2004-07-07  Ulrich Drepper  <drepper@redhat.com>
> 
>         * elf/dl-fini.c (_dl_fini): Move the unlock of the ld.so lock
>         before the loop running the destructors.
...
[snip]
...

Ah, sorry, overlooked this comment.

Comment 10 Andreas Jaeger 2012-05-06 09:03:13 UTC

I checked glibc 2.15 and the testcase works for me, I assume this is fixed with:

2006-10-27  Ulrich Drepper  <drepper@redhat.com>

	* elf/dl-close.c (_dl_close_worker): Renamed from _dl_close and
	split out locking and parameter checking.
	(_dl_close): Call _dl_close_worker after locking and checking.
	* elf/dl-open.c (_dl_open): Call _dl_close_worker instead of
	_dl_close.
	* elf/Makefile: Add rules to build and run tst-thrlock.
	* elf/tst-thrlock.c:  New file.


If you still have the problem with glibc 2.15, please reopen and tell us a better way to reproduce.