Unexplainable deadlock detection assert in libpthread

Szabolcs Nagy szabolcs.nagy@arm.com
Fri Nov 24 12:03:00 GMT 2017


On 24/11/17 11:12, Schmitz, Arne wrote:
> Hi everyone,
> 
> In very rare cases our software gets a signal ABORT due to an assertion in pthread_mutex_lock. Unfortunately it is not easily reproducible and it occurs infrequently. We seek some support in understanding what happens, because the common explanations fail in this case. We are using libc 2.21 and libpthread 2.21 on an aarch64-gnu-linux platform.
> 
> We see the following information in our logging:
> 
> natp.exe[4300]: Informational: natp.exe: pthread_mutex_lock.c:352: __pthread_mutex_lock_full: Assertion `(-(e)) != 35 || (kind != PTHREAD_MUTEX_ERRORCHECK_NP && kind != PTHREAD_MUTEX_RECURSIVE_NP)' failed.
> 
> Reannotated call stacks all share the function Logger::hasConsumer (#8):
> 
> [Switching to thread 1 (LWP 4864)]
> #0  0x0000007f84548a5c in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
> 55           ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> #0  0x0000007f84548a5c in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
> #1  0x0000007f8454bdb0 in __GI_abort () at abort.c:89
> #2  0x0000007f845424b0 in __assert_fail_base (fmt=0x7f8462dc38 <error: Cannot access memory at address 0x7f8462dc38>,
>     assertion=assertion@entry=0x7f844c6f80 <error: Cannot access memory at address 0x7f844c6f80>, file=file@entry=0x7f844c7228 <error: Cannot access memory at address 0x7f844c7228>,
>     line=line@entry=352, function=function@entry=0x7f844c7080 <error: Cannot access memory at address 0x7f844c7080>) at assert.c:92
> #3  0x0000007f84542558 in __GI___assert_fail (assertion=assertion@entry=0x7f844c6f80 <error: Cannot access memory at address 0x7f844c6f80>,
>     file=file@entry=0x7f844c7228 <error: Cannot access memory at address 0x7f844c7228>, line=line@entry=352,
>     function=function@entry=0x7f844c7080 <error: Cannot access memory at address 0x7f844c7080>) at assert.c:101
> #4  0x0000007f844bd3c0 in __pthread_mutex_lock_full (mutex=0x7f842cc740) at pthread_mutex_lock.c:350
> #5  0x0000007f844bd4c8 in __GI___pthread_mutex_lock (mutex=0x7f842cc740) at pthread_mutex_lock.c:73
> #6  0x0000007f846c9c5c in StarRec_PAL_MutexWait ()
> #7  0x0000007f82c50f04 in nuance::common::Mutex::lock() ()
> #8  0x0000007f82be8e1c in nuance::common::logging::Logger::hasConsumer(nuance::common::ILogModule*, nuance::common::LogZone) ()
> #9  0x0000007f82be87cc in nuance::common::logging::Logger::logText(nuance::common::ILogModule const*, nuance::common::LogZone, char const*, char const*, int) ()
> #10 0x0000007f810c4868 in DragonDriveCommon_monitorCallback ()
> #11 0x0000007f847b9d78 in sGCFRouter_RouterThread ()
> #12 0x0000007f846bd47c in fpThreadRoutine_setThreadName ()
> #13 0x0000007f844bae50 in start_thread (arg=0x7f8480b000) at pthread_create.c:338
> #14 0x0000007f845df3c0 in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:89
> (gdb) quit
> 
> Source of the assertion is this block (from pthread_mutex_lock.c, function __pthread_mutex_lock_full):
> 
>         oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
>                                                       newval, 0);
> 
>         if (oldval != 0)
>           {
>             /* The mutex is locked.  The kernel will now take care of
>                everything.  */
>             int private = (robust
>                            ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
>                            : PTHREAD_MUTEX_PSHARED (mutex));
>             INTERNAL_SYSCALL_DECL (__err);
>             int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
>                                       __lll_private_flag (FUTEX_LOCK_PI,
>                                                           private), 1, 0);
> 
>             if (INTERNAL_SYSCALL_ERROR_P (e, __err)
>                 && (INTERNAL_SYSCALL_ERRNO (e, __err) == ESRCH
>                     || INTERNAL_SYSCALL_ERRNO (e, __err) == EDEADLK))
>               {
> ===>            assert (INTERNAL_SYSCALL_ERRNO (e, __err) != EDEADLK
>                         || (kind != PTHREAD_MUTEX_ERRORCHECK_NP
>                             && kind != PTHREAD_MUTEX_RECURSIVE_NP));
> 
> 
> So the assertion means that the syscall futex returned with EDEADLK. This is also confirmed by the state of the stack variables in the core file.
> 
> From the pthread source before that block (see below) we can be sure that it cannot be a "simple" recursive mutex call, because that would have been detected upfront by the pthread library based on the thread ID. And would not have resulted in an assertion (see marker). That has been confirmed by simple test programs that simulate a "simple" recursive mutex lock, these do not get the "abort", but pthreadmutexlock returns an error instead.
> 
>         /* Check whether we already hold the mutex.  */
>        if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
>           {
>             if (kind == PTHREAD_MUTEX_ERRORCHECK_NP)
>               {
>                 THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> ==>             return EDEADLK;
>               }
> 
>             if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
>               {
>                 THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> 
>                 /* Just bump the counter.  */
>                 if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
>                   /* Overflow of the counter.  */
>                   return EAGAIN;
> 
>                 ++mutex->__data.__count;
> 
>                 return 0;
>               }
>           }
> 
> 
> So we are looking for a deadlock involving the thread that gets aborted and other threads. To have such a deadlock it would require on top one more ressource to be acquired in addition to the mutex. But inspection of the code in Logger::hasConsumer() showed that no other ressource is used in the locked block. Also there is no way to leave that block without unlocking the mutex (it uses a ScopedLock Pattern).
> 
> Due to these non-findings we of course still do consider a memory corruption. But for this we have no other signs. Dynamic code analysis with valgrind also did not reveal obvious problems. So now we also start to consider much less common causes (listed in order of probablity):
> 
> Compiler Bug
> Bug in pthread
> Bug in Kernel futex deadlock detection.
> Bug in Processor or Memory
> We are pointed into that direction by another striking fact: In all core files (about 13 at the moment) the thread ID of the aborted thread always has a zero in the lower byte. While we are pretty sure that this is not "by accident" we do not know if this is the cause or a symptom.
> 
> The lack of a reproduction case to improve on the empirical side and the lack of findings in the code analysis leed us to the following questions:
> 
> Did anybody else ever see similar symptoms? We are using linux 3.18.40, pthread 2.21, arm aarch64 gcc-4.9.4

i haven't seen this issue, but that kernel+libc is very old
now in the aarch64 world.

if you think the thread id (0x1300) somehow matters, you can
try to look at the tid in the tcb, in oldval and in the
recursive mutex to see if their value is expected.

> Are there any known occurences of unexplainable deadlocks?
> Are there any known pthread library bugs?
> 
> Best regards,
> 
> Arne
> 



More information about the Libc-help mailing list