Bug 21083

Summary: robust timed lock may leave other waiters no chance to wake up
Product: glibc Reporter: ma.jiang
Component: nptlAssignee: Not yet assigned to anyone <unassigned>
Status: UNCONFIRMED ---    
Severity: normal CC: carlos, drepper.fsp
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
See Also: https://sourceware.org/bugzilla/show_bug.cgi?id=14485
Host: Target:
Build: Last reconfirmed:
Attachments: patch to fix the bug

Description ma.jiang 2017-01-25 10:16:02 UTC
Created attachment 9772 [details]
patch to fix the bug

Hi all,
  I found the bug in robust timed mutex still existed in the latest glibc source, when porting some local patches to the newest glibc release.
  Provided there were 3 threads A,B and C, all of which were trying to manipulate the same robust mutex. See the following illustration.

A:                       B:                            C:
pthread_mutex_lock(m);
                         pthread_mutex_timedlock(m);
                                                       pthread_mutex_lock(m);                                       
pthread_mutex_unlock(m);
pthread_mutex_lock(m);

pthread_mutex_unlock(m);

First, A locked m. Then B and C try to lock m, and both get traped(wait in kernel). 
A unlock m, wake up B and clear the FUTEX_WAITER flag.
After B wake up(and have not do anything yet), A preempt B and lock m again. At this time, A will not add FUTEX_WAITER flag to m as it does not know whether there are other waiters.
B continue to run, it call gettimeofday and find the time is out. So, it just return ETIMEOUT without trying to set  FUTEX_WAITER flag.
A unlock m , and do not do the wake syscall as there is no FUTEX_WAITER.
C get stuck in kernel forever.

The core logic of this problem is that pthread_mutex_timedlock should set the FUTEX_WAITER flag before check timeout. So, the fix to the bug is very simple. We just need to move the code block which set the FUTEX_WAITER flag forward. I have created a patch to fix(see attachment) , is it ok for trunk?
Comment 1 Carlos O'Donell 2017-10-24 21:14:13 UTC
(In reply to ma.jiang from comment #0)
> The core logic of this problem is that pthread_mutex_timedlock should set
> the FUTEX_WAITER flag before check timeout. So, the fix to the bug is very
> simple. We just need to move the code block which set the FUTEX_WAITER flag
> forward. I have created a patch to fix(see attachment) , is it ok for trunk?

When you say "latest glibc sources" do you mean that you tested on master? How recently?
Comment 2 ma.jiang 2017-10-25 09:54:53 UTC
(In reply to Carlos O'Donell from comment #1)

> When you say "latest glibc sources" do you mean that you tested on master?
> How recently?
  The patch is for glibc-2.24. I checked the trunk code just now, the bug still exist. I have not write a testcase to reproduce the bug, it's quite hard(Although the bug logic is clear)...
Comment 3 Carlos O'Donell 2017-10-25 15:58:42 UTC
(In reply to ma.jiang from comment #2)
> (In reply to Carlos O'Donell from comment #1)
> 
> > When you say "latest glibc sources" do you mean that you tested on master?
> > How recently?
>   The patch is for glibc-2.24. I checked the trunk code just now, the bug
> still exist. I have not write a testcase to reproduce the bug, it's quite
> hard(Although the bug logic is clear)...

Thank you for the verification. I've been collecting a number of robust mutex fixes, and I think in the next development cycle we'll try to address the last remaining issues, including this one.
Comment 4 Torvald Riegel 2017-11-01 16:24:26 UTC
(In reply to ma.jiang from comment #2)
> (In reply to Carlos O'Donell from comment #1)
> 
> > When you say "latest glibc sources" do you mean that you tested on master?
> > How recently?
>   The patch is for glibc-2.24. I checked the trunk code just now, the bug
> still exist. I have not write a testcase to reproduce the bug, it's quite
> hard(Although the bug logic is clear)...

I believe this is fixed in master(/trunk).  Please have a look at what we do with assume_other_futex_waiters: If we ever potentially share FUTEX_WAITERS with another thread (C in your example), then after the futex operation we make sure to set this flag in any case.  This happens either when acquiring the mutex (in which case we'll notice this during unlock and wake up C), or when not being able to acquire it, in which case there must be another owner that interfered and we delegate wake-up responsibilities to that other owner by setting FUTEX_WAITERS.
Comment 5 ma.jiang 2017-11-06 11:17:36 UTC
(In reply to Torvald Riegel from comment #4)
> (In reply to ma.jiang from comment #2)
> > (In reply to Carlos O'Donell from comment #1)
> > 
> > > When you say "latest glibc sources" do you mean that you tested on master?
> > > How recently?
> >   The patch is for glibc-2.24. I checked the trunk code just now, the bug
> > still exist. I have not write a testcase to reproduce the bug, it's quite
> > hard(Although the bug logic is clear)...
> 
> I believe this is fixed in master(/trunk).  Please have a look at what we do
> with assume_other_futex_waiters: If we ever potentially share FUTEX_WAITERS
> with another thread (C in your example), then after the futex operation we
> make sure to set this flag in any case.  This happens either when acquiring
> the mutex (in which case we'll notice this during unlock and wake up C), or
> when not being able to acquire it, in which case there must be another owner
> that interfered and we delegate wake-up responsibilities to that other owner
> by setting FUTEX_WAITERS.
Yes, assume_other_futex_waiters can have FUTEX_WAITERS(after B get waked up), But the there are still no FUTEX_WAITERS in mutex->__data.__lock. The core logic is that pthread_mutex_timedlock could return ETIMEDOUT just before adding  FUTEX_WAITERS to mutex->__data.__lock. See attached code chunk. 

#if (!defined __ASSUME_FUTEX_CLOCK_REALTIME \
     || !defined lll_futex_timed_wait_bitset)
	  struct timeval tv;
	  struct timespec rt;

	  /* Get the current time.  */
	  (void) __gettimeofday (&tv, NULL);

	  /* Compute relative timeout.  */
	  rt.tv_sec = abstime->tv_sec - tv.tv_sec;
	  rt.tv_nsec = abstime->tv_nsec - tv.tv_usec * 1000;
	  if (rt.tv_nsec < 0)
	    {
	      rt.tv_nsec += 1000000000;
	      --rt.tv_sec;
	    }

	  /* Already timed out?  */
	  if (rt.tv_sec < 0)
	    return ETIMEDOUT;  //->>> We have returned here, with FUTEX_WAITERS in mutex->__data.__lock.
#endif

	  /* We cannot acquire the mutex nor has its owner died.  Thus, try
	     to block using futexes.  Set FUTEX_WAITERS if necessary so that
	     other threads are aware that there are potentially threads
	     blocked on the futex.  Restart if oldval changed in the
	     meantime.  */
	  if ((oldval & FUTEX_WAITERS) == 0)
	    {
	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
							oldval | FUTEX_WAITERS,
							oldval)
		  != 0)
		{
		  oldval = mutex->__data.__lock;
		  continue;
		}
	      oldval |= FUTEX_WAITERS;
	    }
Comment 6 ma.jiang 2017-11-06 11:19:48 UTC
sorry, made a stupid mistake...
	  /* Already timed out?  */
	  if (rt.tv_sec < 0)
	    return ETIMEDOUT;  //->>> We have returned here, W_I_T_H_O_U_T FUTEX_WAITERS in mutex->__data.__lock.