Overview Description: The program loads a module (.so-library) using dlopen(). During this action a global C++-object is created. No matter how is it created - as a global stack variable or as a new()-ly created object using __attribute__ ((constructor)) function - in either case the bug is triggered. The constructor of this object spawns a thread. Then the program unloads the dynamically-loaded module. A destructor of the mentioned object is called, it calls a function, which tries to cancel the mentioned spawned thread. The thread is of type PTHREAD_CANCEL_DEFERRED and periodically checks for its cancelling by pthread_testcancel(), so it catches the the cancellatiob request. The main thread calls pthread_join() to join the second thread and the whole program hangs up! If the function which cancel the second thread is called explicitly (not from the destructor) before the module unloading, the second thread cancels and joins fine. Steps to Reproduce: 1) Unpack the attached tarball. It is the trimmed-down testcase of the actual big application. 2) Run "./compile" to compile the test program and the module. 3) Run "./run" to see messages and the program hangup. 4) Press Ctrl-C to reclaim the command prompt. 5) Run "./test ./libmodule.so foo" to see a normal program behaviour in case of explicit thread cancelling. Actual Results: 1) Output of running "./run" or "./test ./libmodule.so". --- $ ./run loading ./libtestmod.so now Constructor called hi there, new thread is up and running, thread id is -1210377296 Constructor finished pureShutdown::func(void*) called = thread -1210377296 is still running... = thread -1210377296 is still running... = thread -1210377296 is still running... = thread -1210377296 is still running... unloading ./libtestmod.so now Destructor called modShutdown() called bye, cancelling down thread -1210377296 running pthread_join(g_tid, &result) ... --- (the program hangs here) 2) Output of running "./test ./libmodule.so foo". --- $ ./test ./libtestmod.so foo loading ./libtestmod.so now Constructor called hi there, new thread is up and running, thread id is -1210377296 Constructor finished pureShutdown::func(void*) called = thread -1210377296 is still running... = thread -1210377296 is still running... = thread -1210377296 is still running... = thread -1210377296 is still running... modShutdown() called bye, cancelling down thread -1210377296 running pthread_join(g_tid, &result) ... returned from pthread_join(g_tid, &result) ! all's well that end's well modShutdown() finished unloading ./libtestmod.so now Destructor called modShutdown() called Destructor finished --- (the program exits with code 0 here) 3) GDB session of the first case (running "./run" or "./test ./libmodule.so"). --- $ gdb GNU gdb 6.1.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i686-pc-linux-gnu". (gdb) file ./test Reading symbols from ./test...done. Using host libthread_db library "/lib/libthread_db.so.1". (gdb) run ./libtestmod.so Starting program: /home/ses/test/test ./libtestmod.so [Thread debugging using libthread_db enabled] [New Thread -1210374480 (LWP 2594)] loading ./libtestmod.so now Constructor called [New Thread -1210377296 (LWP 2597)] hi there, new thread is up and running, thread id is -1210377296 Constructor finished pureShutdown::func(void*) called = thread -1210377296 is still running... = thread -1210377296 is still running... = thread -1210377296 is still running... = thread -1210377296 is still running... unloading ./libtestmod.so now Destructor called modShutdown() called bye, cancelling down thread -1210377296 running pthread_join(g_tid, &result) ... Program received signal SIG32, Real-time event 32. [Switching to Thread -1210377296 (LWP 2597)] 0xffffe410 in ?? () (gdb) bt #0 0xffffe410 in ?? () #1 0xb7db1468 in ?? () #2 0xb7fd6ff8 in ?? () from /lib/libpthread.so.0 #3 0x00000000 in ?? () #4 0xb7fd2cf6 in __nanosleep_nocancel () from /lib/libpthread.so.0 #5 0xb7fe3ddd in pureShutdown::func () at module.cpp:71 #6 0xb7fcd3c0 in start_thread () from /lib/libpthread.so.0 #7 0xb7e6c24e in clone () from /lib/libc.so.6 (gdb) kill Kill the program being debugged? (y or n) y (gdb) quit $ --- kill -l haven't print what the SIG32 is. Google said that it is SIGTRAP. Expected Results: the program in the first case should not hang up, but the second thread should terminate correctly, the module should be unloaded correctly and the whole program should exit with code 0. Build Date: 2004-01-12 System information: Processor: Pentium III (Coppermine) 667.080 Mhz Distribuition: Linux From Scratch 6.0 with RPM and some packages updated Kernel version: 2.6.9, unpatched AFAIK Glibc version: Snapshot of 2005-01-10 from ftp://sources.redhat.com/pub/glibc/snapshots/glibc-20050110.tar.bz2: --- GNU C Library development release version 2.3.90, by Roland McGrath et al. Copyright (C) 2004 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 3.4.3. Compiled on a Linux 2.6.9 system on 2005-01-11. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others Native POSIX Threads Library by Ulrich Drepper et al BIND-8.2.3-T5B NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. --- Sorry for not trying the latest CVS. I haven't got an access to the outside network CVS from my corporate network. And judjing from [glibc]/libc/nptl/ChangeLog on CvsWeb, nothing changed during the last 2 days in the nptl. Glibc "./configure" switches (excluding "--*dir=" switches): --- --disable-profile \ --enable-add-ons=nptl \ --with-tls \ --with-__thread \ --enable-kernel=2.6.9 \ --without-cvs \ --with-headers=/usr/src/linux-2.6.9/include --- Glibc was built into the rpm packages with the aid of rpm. GCC version: --- $ gcc -v Reading specs from /usr/lib/gcc/i686-pc-linux-gnu/3.4.3/specs Configured with: ../gcc-3.4.3/configure --host=i686-pc-linux-gnu --build=i686-pc-linux-gnu --target=i686-pc-linux-gnu --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-languages=c,c++ Thread model: posix gcc version 3.4.3 --- Ld/Binutils version: --- $ ld -v GNU ld version 2.15.91.0.1 20040527 --- Hoping, all the provided information will help. If you need more info - please feel free to ask. Also feel free to request additional testing/investigation. And also an advice would be helpful how to write the patch myself.
Created attachment 350 [details] Testcase for the bug.
I've tested the same testcase on another system, having kernel 2.4.20 and glibc 2.3.2 with linuxthreads. The program ran just fine. The test has been conducted today, 2004-01-13. The output: --- $ ./run loading ./libtestmod.so now Constructor called pureShutdown::func(void*) called hi there, new thread is up and running, thread id is 16386 Constructor finished = thread 16386 is still running... = thread 16386 is still running... = thread 16386 is still running... = thread 16386 is still running... unloading ./libtestmod.so now Destructor called modShutdown() called bye, cancelling down thread 16386 running pthread_join(g_tid, &result) ... returned from pthread_join(g_tid, &result) ! all's well that end's well modShutdown() finished Destructor finished $ --- System information: CPU: Intel(R) Xeon(TM) CPU 2.80GHz Distribution: SuSE Linux 8.2 Kernel: 2.4.20-64GB-SMP, from the SuSE distribution Glibc version: --- $ /lib/libc.so.6 GNU C Library stable release version 2.3.2, by Roland McGrath et al. Copyright (C) 2003 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 3.3 20030226 (prerelease) (SuSE Linux). Compiled on a Linux 2.4.20 system on 2003-03-13. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others linuxthreads-0.10 by Xavier Leroy NoVersion patch for broken glibc 2.0 binaries BIND-8.2.3-T5B libthread_db work sponsored by Alpha Processor Inc NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Report bugs using the `glibcbug' script to <bugs@gnu.org>. --- GCC version: --- $ gcc -v Reading specs from /usr/lib/gcc-lib/i486-suse-linux/3.3/specs Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib --enable-languages=c,c++,f77,objc,java,ada --disable-checking --enable-libgcj --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit i486-suse-linux Thread model: posix gcc version 3.3 20030226 (prerelease) (SuSE Linux) --- Ld/Binutils version: --- $ ld -v GNU ld version 2.13.90.0.18 20030121 (SuSE Linux) ---
I've investigated the problem further. I've found a (not very precise) place in the libc where the hangup takes place. It's the file nptl/pthread_join.c, line 86, which looks like --- /* Wait for the child. */ lll_wait_tid (pd->tid); --- lll_wait_tid is a macro with assembler code which I don't understand so far: --- /* The kernel notifies a process with uses CLONE_CLEARTID via futex wakeup when the clone terminates. The memory location contains the thread ID while the clone is running and is reset to zero afterwards. The macro parameter must not have any side effect. */ #define lll_wait_tid(tid) \ do { \ int __ignore; \ register __typeof (tid) _tid asm ("edx") = (tid); \ if (_tid != 0) \ __asm __volatile (LLL_EBX_LOAD \ "1:\tmovl %1, %%eax\n\t" \ LLL_ENTER_KERNEL \ "cmpl $0, (%%ebx)\n\t" \ "jne,pn 1b\n\t" \ LLL_EBX_LOAD \ : "=&a" (__ignore) \ : "i" (SYS_futex), LLL_EBX_REG (&tid), "S" (0), \ "c" (FUTEX_WAIT), "d" (_tid), \ "i" (offsetof (tcbhead_t, sysinfo))); \ } while (0) ---
This is the same deadlock as has been fixed by: 2004-07-07 Ulrich Drepper <drepper@redhat.com> * elf/dl-fini.c (_dl_fini): Move the unlock of the ld.so lock before the loop running the destructors. for destructors that are run at exit time. ATM ld.so holds dl_load_lock when running shared library destructors and the same lock is used indirectly by libgcc_s.so when unwinding. If you call pthread_cancel in a shared library destructor that is run during dlclose, dl_load_lock is held in the thread calling pthread_cancel, but the cancelled thread needs to be unwound. As you also call pthread_join in the same destructor that waits for the cancelled thread and the cancelled thread is waiting until dl_load_lock is released (this would happen when dlclose is about to return), they are deadlocking. The fix is avoid running shared library destructors with dl_load_lock held, but that's certainly not trivial.
I've constructed a patch that unlocks dl_load_lock just before running the destructors, and locks it again just after that. My testcase now runs properly, but I don't know wether or not my patch has any side-effects. So, dear glibc developers, please watch it and either confirm that the patch is correct or point me where am I wrong. Thanks. The patch: --- --- glibc/elf/dl-close.c.orig 2005-01-09 09:27:52.000000000 +0100 +++ glibc/elf/dl-close.c 2005-01-17 15:04:52.000000000 +0100 @@ -265,6 +265,10 @@ } assert (new_opencount[0] == 0); + /* Release dl_load_lock during running destructors, + like in dl-fini.c. */ + __rtld_lock_unlock_recursive (GL(dl_load_lock)); + /* Call all termination functions at once. */ #ifdef SHARED bool do_audit = GLRO(dl_naudit) > 0 && !GL(dl_ns)[ns]._ns_loaded->l_auditing; @@ -389,6 +393,9 @@ assert (imap->l_type == lt_loaded || imap->l_opencount > 0); } + /* Destructors finished, acquire dl_load_lock again. */ + __rtld_lock_lock_recursive (GL(dl_load_lock)); + #ifdef SHARED /* Auditing checkpoint: we will start deleting objects. */ if (__builtin_expect (do_audit, 0)) ---
Created attachment 369 [details] Proposed patch, the first try. This is the same patch as listed in the comment #5.
Nothing related to C++, exceptions, and dlopen can be critical.
Hey, guys, I found this testcase can't trigger the bug in recent glibc. How is this bug finally resovled? I have searched glibc git commit message for dl_load_lock, but nothing showed up.
(In reply to comment #4) > This is the same deadlock as has been fixed by: > 2004-07-07 Ulrich Drepper <drepper@redhat.com> > > * elf/dl-fini.c (_dl_fini): Move the unlock of the ld.so lock > before the loop running the destructors. ... [snip] ... Ah, sorry, overlooked this comment.
I checked glibc 2.15 and the testcase works for me, I assume this is fixed with: 2006-10-27 Ulrich Drepper <drepper@redhat.com> * elf/dl-close.c (_dl_close_worker): Renamed from _dl_close and split out locking and parameter checking. (_dl_close): Call _dl_close_worker after locking and checking. * elf/dl-open.c (_dl_open): Call _dl_close_worker instead of _dl_close. * elf/Makefile: Add rules to build and run tst-thrlock. * elf/tst-thrlock.c: New file. If you still have the problem with glibc 2.15, please reopen and tell us a better way to reproduce.