Bug 19402

Summary: Deadlock with robust shared mutex and asynchronous termination
Product: glibc Reporter: raoulgough
Component: nptlAssignee: Not yet assigned to anyone <unassigned>
Severity: normal CC: carlos, drepper.fsp, fweimer, raoulgough, triegel
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: 2.25   
See Also: https://sourceware.org/bugzilla/show_bug.cgi?id=19004
Host: Target:
Build: Last reconfirmed:
Attachments: Test program for shared robust mutex with asynchronous termination

Description raoulgough 2015-12-24 09:15:17 UTC
Created attachment 8863 [details]
Test program for shared robust mutex with asynchronous termination

I have a test program that sets up a process-shared robust mutex, and then forks a child that repeatedly locks and unlocks the mutex in a loop while the parent concurrently sends a signal to kill the child. It repeats this, forking a new child each time, until the required number of iterations complete and the program exists or the program stops making progress for some reason.

Steps to reproduce
Compile and run the attached test program on a Linux system with at least two CPU cores.

$ gcc -Wall -g  test_robust.c  -lpthread -o test_robust
$ ./test_robust
iteration 0
  C-c C-c

Actual results
The program stalls, on my system usually after less than 200 iterations.

Expected results
I expect the program always to run to completion in a reasonable amount of time. The stalled state shouldn't exist because the robust mutex was either locked or unlocked at the time the child terminated, and so pthread_mutex_lock in the parent should return either EOWNERDEAD or no error, respectively.

Test platform
Dual-core Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz
Linux debian8 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u1 (2015-12-14) x86_64 GNU/Linux
Hosted under VMWare Fusion 4.1.4 on Mac OS X 10.6.8

Results are the same both with the stock debian jessie glibc (2.19-18+deb8) and with a glibc built on this system from git glibc sources as of commit 2cf3e1aa7477b813ca3ebb901003a7d44f970218 (2015-12-22 13:39:19 UTC).

I have seen a similar test program exhibit the same problem on bare metal Red Hat Linux systems, so I don't think it is VMWare or Debian specific.

Further details

I have done a few things to investigate this further. Firstly, in the stalled state the child process is definitely terminated but the parent process is stuck in a futex system call - see example session below:

$ make test_robust
cc -Wall -g -I/home/raoul/install/include -I /usr/lib/gcc/x86_64-linux-gnu/4.9.2/include -I /usr/include -I /usr/include/x86_64-linux-gnu -L/home/raoul/install/lib -Wl,-rpath=/home/raoul/install/lib -Wl,--dynamic-linker=/home/raoul/install/lib/ld-2.22.90.so  test_robust.c  -lpthread -o test_robust
$ ./test_robust &
[1] 25446
$ iteration 0
iteration 100

$ ps
  PID TTY          TIME CMD
25144 pts/2    00:00:00 bash
25446 pts/2    00:00:00 test_robust
25549 pts/2    00:00:00 ps
$ cat /proc/$!/stack
[<ffffffff810d18cc>] futex_wait_queue_me+0xcc/0x130
[<ffffffff810d1ba2>] futex_wait+0x162/0x260
[<ffffffff8108958a>] posix_cpu_timers_exit+0x1a/0x30
[<ffffffff810d3d22>] do_futex+0xf2/0xb60
[<ffffffff810a76b2>] remove_wait_queue+0x12/0x50
[<ffffffff810694fc>] do_wait+0x11c/0x230
[<ffffffff810d47fe>] SyS_futex+0x6e/0x150
[<ffffffff81068410>] child_wait_callback+0x0/0x60
[<ffffffff81513ccd>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

Secondly, it is possible to attach gdb to the parent and inspect the state of the mutex in the stalled state, which (I think) shows the mutex is "locked" by the terminated child process:

$ gdb -p $!
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
Find the GDB manual and other documentation resources online at:
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 25446
Reading symbols from /home/raoul/c/test_robust...done.
Reading symbols from /home/raoul/install/lib/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/home/raoul/install/lib/libthread_db.so.1".
Loaded symbols for /home/raoul/install/lib/libpthread.so.0
Reading symbols from /home/raoul/install/lib/libc.so.6...done.
Loaded symbols for /home/raoul/install/lib/libc.so.6
Reading symbols from /home/raoul/install/lib/ld-2.22.90.so...done.
Loaded symbols for /home/raoul/install/lib/ld-2.22.90.so
__lll_robust_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevelrobustlock.S:85
85		movl	(%rdi), %eax
(gdb) frame 3
#3  0x0000000000401190 in main (argc=1, argv=0x7ffde4bc39d8)
    at test_robust.c:137
137	      do_lock(shared);
(gdb) set print pretty
(gdb) print *shared
$1 = {
  mutex = {
    __data = {
      __lock = -2147458100, 
      __count = 1, 
      __owner = 0, 
      __nusers = 0, 
      __kind = 656, 
      __spins = 0, 
      __elision = 0, 
      __list = {
        __prev = 0x0, 
        __next = 0x0
    __size = "\314c\000\200\001", '\000' <repeats 11 times>, "\220\002", '\000' <repeats 21 times>, 
    __align = 6442476492
  child_state = 2, 
  child_pid = 25548, 
  reclaims = 95
(gdb) print/x shared->mutex.__data.__lock
$2 = 0x800063cc
(gdb) print shared->mutex.__data.__lock & 0xffff
$3 = 25548
(gdb) print shared->child_pid
$4 = 25548
(gdb) quit
A debugging session is active.

	Inferior 1 [process 25446] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/raoul/c/test_robust, process 25446
$ kill %%; wait %%
[1]+  Terminated              ./test_robust

Finally, it is possible to parameterize the test program such that it
sends the child a SIGABRT instead of a SIGTERM, which causes it to
dump core. This allows us to see where the child process was executing
when it terminated. In my case, I have no corefile pattern on the
system so the same corefile gets repeatedly overwritten until the
parent stalls:

$ ulimit -c unlimited
$ ./test_robust 200 6 &
[1] 26200
$ iteration 0
iteration 100

$ gdb test_robust core
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f52d0a23735 in __pthread_mutex_lock_full (mutex=0x7f52d0e55000)
    at ../nptl/pthread_mutex_lock.c:256
256		  oldval = LLL_ROBUST_MUTEX_LOCK (mutex, id);
(gdb) disas
Dump of assembler code for function __pthread_mutex_lock_full:
   0x00007f52d0a23710 <+768>:	and    $0x40000000,%eax
   0x00007f52d0a23715 <+773>:	jne    0x7f52d0a237b8 <__pthread_mutex_lock_full+936>
   0x00007f52d0a2371b <+779>:	and    $0x3fffffff,%edx
   0x00007f52d0a23721 <+785>:	cmp    %edx,%ebp
   0x00007f52d0a23723 <+787>:	je     0x7f52d0a23815 <__pthread_mutex_lock_full+1029>
   0x00007f52d0a23729 <+793>:	mov    %ebp,%edi
   0x00007f52d0a2372b <+795>:	mov    %r10d,%eax
   0x00007f52d0a2372e <+798>:	mov    %r9d,%esi
   0x00007f52d0a23731 <+801>:	lock cmpxchg %edi,(%rbx)
=> 0x00007f52d0a23735 <+805>:	je     0x7f52d0a2374d <__pthread_mutex_lock_full+829>
   0x00007f52d0a23737 <+807>:	lea    (%rbx),%rdi
   0x00007f52d0a2373a <+810>:	sub    $0x80,%rsp
   0x00007f52d0a23741 <+817>:	callq  0x7f52d0a29e50 <__lll_robust_lock_wait>
End of assembler dump.
(gdb) info reg eflags
eflags         0x246	[ PF ZF IF ]
(gdb) where
#0  0x00007f52d0a23735 in __pthread_mutex_lock_full (mutex=0x7f52d0e55000)
    at ../nptl/pthread_mutex_lock.c:256
#1  0x0000000000400e04 in run_child (shared=0x7f52d0e55000)
    at test_robust.c:62
#2  0x0000000000400f0b in do_fork (shared=0x7f52d0e55000) at test_robust.c:81
#3  0x000000000040119c in main (argc=3, argv=0x7fff630d1cd8)
    at test_robust.c:138
(gdb) frame 3
#3  0x000000000040119c in main (argc=3, argv=0x7fff630d1cd8)
    at test_robust.c:138
138	      do_fork(shared);
(gdb) set print pretty
(gdb) print *shared
$1 = {
  mutex = {
    __data = {
      __lock = 26301, 
      __count = 1, 
      __owner = 0, 
      __nusers = 0, 
      __kind = 656, 
      __spins = 0, 
      __elision = 0, 
      __list = {
        __prev = 0x0, 
        __next = 0x0
    __size = "\275f\000\000\001", '\000' <repeats 11 times>, "\220\002", '\000' <repeats 21 times>, 
    __align = 4294993597
  child_state = 2, 
  child_pid = 26301, 
  reclaims = 89
(gdb) print/x shared->mutex
$2 = {
  __data = {
    __lock = 0x66bd, 
    __count = 0x1, 
    __owner = 0x0, 
    __nusers = 0x0, 
    __kind = 0x290, 
    __spins = 0x0, 
    __elision = 0x0, 
    __list = {
      __prev = 0x0, 
      __next = 0x0
  __size = {0xbd, 0x66, 0x0, 0x0, 0x1, 0x0 <repeats 11 times>, 0x90, 0x2, 
    0x0 <repeats 22 times>}, 
  __align = 0x1000066bd

Here is another slightly different case where the child terminated after the LLL_ROBUST_MUTEX_LOCK, but before (or during) the ENQUEUE_MUTEX:

Program terminated with signal SIGABRT, Aborted.
#0  0x00007f892a4f0777 in __pthread_mutex_lock_full (mutex=0x7f892a922000)
    at ../nptl/pthread_mutex_lock.c:272
272	      ENQUEUE_MUTEX (mutex);
(gdb) disas
   0x00007f892a4f075f <+847>:	movl   $0x1,0x4(%rbx)
   0x00007f892a4f0766 <+854>:	mov    %fs:0x2e0,%rax
   0x00007f892a4f076f <+863>:	and    $0xfffffffffffffffe,%rax
   0x00007f892a4f0773 <+867>:	mov    %r8,-0x8(%rax)
=> 0x00007f892a4f0777 <+871>:	mov    %fs:0x2e0,%rax
   0x00007f892a4f0780 <+880>:	mov    %rax,0x20(%rbx)

My theory is that there are at least some points in the execution of pthread_mutex_lock and/or pthread_mutex_unlock where an asynchronous termination signal can leave the mutex in a locked state which the kernel does not resolve during termination. In the example termination points above, the LLL_ROBUST_MUTEX_LOCK has completed but the mutex has not yet been added to the thread-specific robust mutex list. This kind of termination is not likely, but is possible via an external process sending a signal (as in this test) or another thread in the same program encountering a SEGV or such like.
Comment 1 Carlos O'Donell 2016-12-16 01:09:18 UTC
I can reproduce this issue with a cleaned up testcase that uses C11 atomics to avoid the data races in the orignal test case.

Torvald Riegel, Florian Weimer and I discussed this and the final analysis is that this is clearly a flaw in the glibc handling of the robust list.

The sequence of events isn't even that complicated.

-> set_robust_list
-> setup shared mapping, setup shared mutex
-> lock mutex (eunqueues it onto the robust list)
-> fork child
-> unlock mutex (dequeues it from the list)

Note: The deuque here removes the mutex from the robust list copy in the parent, but he child's copy of the robust list head is unchanged and pointing to the mutex. However, that mutex now has it's own ->next pointer set to null so it is no longer a valid list according to the kernel.

-> locks mutex (owner not dead, doesn't own it, so locks it)

-> Kills P2

-> Inspects P2's robust list, find it contains a head that points to a mutex that points to NULL. This is a dangling list and the cleanup is ignored.

-> locks mutex (and deadlocks).

At this point no recovery is possible.

The key issue is that the mutex itself, in shared memory, is part of the circular linked list itself.

The robust list head itself is stored in the process/thread-local storage, but the rest of the list is shared, made up of the mutexes that are involved in the list.

Therefore P1 and P2 share parts of their own robust lists via the shared mutex's afer the fork. This is a flaw really because the mutexes can only be owned by one thread at a time.

When P2 is forked we need to do some kind of cleanup of the head of the list, probably clearing head (linking it back to itself to make an empty list) just after forking.
Comment 2 Sourceware Commits 2017-01-13 16:19:41 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  8e31cafb268938729a1314806a924d73fb1991c5 (commit)
       via  65810f0ef05e8c9e333f17a44e77808b163ca298 (commit)
      from  f32941d80c7f532031061f8dd4704fab9c275cfe (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------

commit 8e31cafb268938729a1314806a924d73fb1991c5
Author: Torvald Riegel <triegel@redhat.com>
Date:   Wed Dec 21 13:37:19 2016 +0100

    Clear list of acquired robust mutexes in the child process after forking.
    Robust mutexes acquired at the time of a call to fork() do not remain
    acquired by the forked child process.  We have to clear the list of
    acquired robust mutexes before registering this list with the kernel;
    otherwise, if some of the robust mutexes are process-shared, the parent
    process can alter the child's robust mutex list, which can lead to
    deadlocks or even modification of memory that may not be occupied by a
    mutex anymore.
    	[BZ #19402]
    	* sysdeps/nptl/fork.c (__libc_fork): Clear list of acquired robust


commit 65810f0ef05e8c9e333f17a44e77808b163ca298
Author: Torvald Riegel <triegel@redhat.com>
Date:   Thu Dec 22 10:20:43 2016 +0100

    robust mutexes: Fix broken x86 assembly by removing it
    lll_robust_unlock on i386 and x86_64 first sets the futex word to
    FUTEX_WAITERS|0 before calling __lll_unlock_wake, which will set the
    futex word to 0.  If the thread is killed between these steps, then the
    futex word will be FUTEX_WAITERS|0, and the kernel (at least current
    upstream) will not set it to FUTEX_OWNER_DIED|FUTEX_WAITERS because 0 is
    not equal to the TID of the crashed thread.
    The lll_robust_lock assembly code on i386 and x86_64 is not prepared to
    deal with this case because the fastpath tries to only CAS 0 to TID and
    not FUTEX_WAITERS|0 to TID; the slowpath simply waits until it can CAS 0
    to TID or the futex_word has the FUTEX_OWNER_DIED bit set.
    This issue is fixed by removing the custom x86 assembly code and using
    the generic C code instead.  However, instead of adding more duplicate
    code to the custom x86 lowlevellock.h, the code of the lll_robust* functions
    is inlined into the single call sites that exist for each of these functions
    in the pthread_mutex_* functions.  The robust mutex paths in the latter
    have been slightly reorganized to make them simpler.
    This patch is meant to be easy to backport, so C11-style atomics are not
    	[BZ #20985]
    	* nptl/Makefile: Adapt.
    	* nptl/pthread_mutex_cond_lock.c (LLL_ROBUST_MUTEX_LOCK): Remove.
    	* nptl/pthread_mutex_lock.c (LLL_ROBUST_MUTEX_LOCK): Remove.
    	(__pthread_mutex_lock_full): Inline lll_robust* functions and adapt.
    	* nptl/pthread_mutex_timedlock.c (pthread_mutex_timedlock): Inline
    	lll_robust* functions and adapt.
    	* nptl/pthread_mutex_unlock.c (__pthread_mutex_unlock_full): Likewise.
    	* sysdeps/nptl/lowlevellock.h (__lll_robust_lock_wait,
    	__lll_robust_lock, lll_robust_cond_lock, __lll_robust_timedlock_wait,
    	__lll_robust_timedlock, __lll_robust_unlock): Remove.
    	* sysdeps/unix/sysv/linux/i386/lowlevellock.h (lll_robust_lock,
    	lll_robust_cond_lock, lll_robust_timedlock, lll_robust_unlock): Remove.
    	* sysdeps/unix/sysv/linux/x86_64/lowlevellock.h (lll_robust_lock,
    	lll_robust_cond_lock, lll_robust_timedlock, lll_robust_unlock): Remove.
    	* sysdeps/unix/sysv/linux/sparc/lowlevellock.h (__lll_robust_lock_wait,
    	__lll_robust_lock, lll_robust_cond_lock, __lll_robust_timedlock_wait,
    	__lll_robust_timedlock, __lll_robust_unlock): Remove.
    	* nptl/lowlevelrobustlock.c: Remove file.
    	* nptl/lowlevelrobustlock.sym: Likewise.
    	* sysdeps/unix/sysv/linux/i386/lowlevelrobustlock.S: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/lowlevelrobustlock.S: Likewise.


Summary of changes:
 ChangeLog                                          |   33 ++
 nptl/Makefile                                      |    4 +-
 nptl/lowlevelrobustlock.c                          |  136 ---------
 nptl/lowlevelrobustlock.sym                        |    6 -
 nptl/pthread_mutex_cond_lock.c                     |    6 +-
 nptl/pthread_mutex_lock.c                          |   79 ++++--
 nptl/pthread_mutex_timedlock.c                     |  106 ++++++--
 nptl/pthread_mutex_unlock.c                        |   18 +-
 sysdeps/nptl/fork.c                                |   20 +-
 sysdeps/nptl/lowlevellock.h                        |   68 -----
 sysdeps/unix/sysv/linux/i386/lowlevellock.h        |   60 ----
 sysdeps/unix/sysv/linux/i386/lowlevelrobustlock.S  |  232 ---------------
 sysdeps/unix/sysv/linux/sparc/lowlevellock.h       |   40 ---
 sysdeps/unix/sysv/linux/x86_64/lowlevellock.h      |   74 -----
 .../unix/sysv/linux/x86_64/lowlevelrobustlock.S    |  306 --------------------
 15 files changed, 205 insertions(+), 983 deletions(-)
 delete mode 100644 nptl/lowlevelrobustlock.c
 delete mode 100644 nptl/lowlevelrobustlock.sym
 delete mode 100644 sysdeps/unix/sysv/linux/i386/lowlevelrobustlock.S
 delete mode 100644 sysdeps/unix/sysv/linux/x86_64/lowlevelrobustlock.S
Comment 3 Torvald Riegel 2017-01-13 17:14:32 UTC
Fixed in master.