[PATCH] nptl: Fix pthread_cond_signal missing a sleeper (bug 25847)

Fri Dec 25 17:18:26 GMT 2020

Thank you for investigating, the patch and the write-up in the blog. 
Comments below.

On Tue, 2020-12-08 at 22:15 -0500, Malte Skarupke wrote:
> There was a subtle, rare bug in the code that tried to handle the
> case when a thread took a long time to leave pthread_cond_wait,
> and accidentally stole a signal from a later group. As a result
> you could get a situation where pthread_cond_signal doesn't wake
> a thread even though there is a sleeping thread on the condition
> variable.

Yes, the code that tries to mitigate potential stealing didn't fix up
the size of the group as well, so both could diverge, leading to other
signalers sinking a signal into a group that didn't have any waiters
anymore.

> The fix is to make it impossible for a thread to ever get into
> that state, allowing us to remove the code that handles potential
> stealing of signals.
> 
> pthread_cond_signal would already wait for all sleeping threads
> to wake up before closing a group, but pthread_cond_wait had a
> shortcut code-path that didn't increment/decrement g_refs, which
> meant that pthread_cond_signal wouldn't wait for the thread in
> pthread_cond_wait to leave.

Limiting the scope of waiting to those threads that are trying to block
through futexes was intentional.  To be efficient in high-contention
scenarios where threads can synchronize using just shared memory, we
try to make blocking through futexes an "optional" last resort for
cases where wait times may actually be longer (assuming that programs
don't oversubscribe their CPU cores).  Thus, __g_refs was intended to
prevent ABAs on futex words.

> If pthread_cond_signal then closed the
> group too early, and the thread in pthread_cond_wait took a while
> longer to leave the function, it could end up stealing a wake-up
> signal from a later group, which led to a state where a sleeping
> thread on the condition variable would not wake up. There was
> already code in place to try to fix this, but that code had a
> subtle bug.
> 
> In this change I ensure that each thread always increments and
> decrements g_refs, so that pthread_cond_signal can't accidentally
> close a group while there is still a thread in pthread_cond_wait.
> As a result, stealing of signals can't happen, and the code with
> the subtle bug could be removed.

I agree that, if done right, we could have all waiters add a reference
to their group, so that signalers that switch groups wait for all
waiters to quiesce.

However, that may have performance implications because now this
applies to all waiters, even those that didn't futexes to block for a
longer time.

That said, the alternative fix of adding a signal the proper way (ie,
including updating group size) will also be less efficient that the
current (buggy) code.  IIRC, you said that you tried this but it
deadlocked; did you just put a call to pthread_cond_signal in, or did
you adapt it so that it hits the right group slot?  Any idea why it
deadlocked?
This alternative may be a simpler fix than changing the purpose of
__g_refs, unless I'm missing something.

> This also meant that the increment/decrement of wrefs was no
> longer necessary, since g_refs now covers the same scope.

This isn't exactly the same scope, as you also indicate below.  __wrefs
is larger in scope because if was used just for condvar destruction
safety, whereas group switching waits on __g_refs.  Releasing one's
ref-count in __g_refs can't be the last action on the condvar instance
in case of cancellation, which is why you have to add the work-around.

> I
> removed the increment and decrement of wrefs, which means that
> this change should not affect performance.

I think it would be better to start with a patch that just changes the
scope and use of __g_refs, and keeps __wrefs as it is.  This way, you
avoid having to re-introduce __wrefs but just for cancellation, and you
can still use the fast-path check on __wrefs (see below).

I'd be less concerned about performance overheads of an additional
atomic read-modify-write, given that it would be followed by a __g_refs
RMW right after it, and both may be on the same cache line.

> Because wrefs now changed meaning, I also renamed the variable
> from __wrefs to __flags.

I think the name change isn't great.  It still has a ref count and
flags in your patch, but now for just cancellation.

> diff --git a/nptl/pthread_cond_broadcast.c
> b/nptl/pthread_cond_broadcast.c
> index 8d887aab93..e1da07bfe0 100644
> --- a/nptl/pthread_cond_broadcast.c
> +++ b/nptl/pthread_cond_broadcast.c
> @@ -40,10 +40,13 @@ __pthread_cond_broadcast (pthread_cond_t *cond)
>  {
>    LIBC_PROBE (cond_broadcast, 1, cond);
> 
> -  unsigned int wrefs = atomic_load_relaxed (&cond->__data.__wrefs);
> -  if (wrefs >> 3 == 0)
> +  unsigned int grefs0 = atomic_load_relaxed(cond->__data.__g_refs);
> +  unsigned int grefs1 = atomic_load_relaxed(cond->__data.__g_refs +
> 1);
> +  if ((grefs0 >> 1) == 0 && (grefs1 >> 1) == 0)
>      return 0;

I don't think this works because you now have two separate loads.  The
prior code will pick one position in happens-before at which the
broadcast could have happend and load __wrefs there, making this
consistent.  The new code could could be unlucky and check an empty G2
in both cases, meaning that it couldn't be sure that there were no
waiters in any group at the time of any of the two loads.

If you'd keep __wrefs as it is, this problem would go away.

> diff --git a/nptl/pthread_cond_signal.c b/nptl/pthread_cond_signal.c
> index 4281ad4d3b..a59151ce72 100644
> --- a/nptl/pthread_cond_signal.c
> +++ b/nptl/pthread_cond_signal.c
> @@ -39,10 +39,13 @@ __pthread_cond_signal (pthread_cond_t *cond)
>    /* First check whether there are waiters.  Relaxed MO is fine for
> that for
>       the same reasons that relaxed MO is fine when observing __wseq
> (see
>       below).  */
> -  unsigned int wrefs = atomic_load_relaxed (&cond->__data.__wrefs);
> -  if (wrefs >> 3 == 0)
> +  unsigned int grefs0 = atomic_load_relaxed(cond->__data.__g_refs);
> +  unsigned int grefs1 = atomic_load_relaxed(cond->__data.__g_refs +
> 1);
> +  if ((grefs0 >> 1) == 0 && (grefs1 >> 1) == 0)
>      return 0;

See above.

> diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
> index 02d11c61db..64ce8ea8ee 100644
> --- a/nptl/pthread_cond_wait.c
> +++ b/nptl/pthread_cond_wait.c
> @@ -43,18 +43,6 @@ struct _condvar_cleanup_buffer
>  };
> 
> 
> -/* Decrease the waiter reference count.  */
> -static void
> -__condvar_confirm_wakeup (pthread_cond_t *cond, int private)
> -{
> -  /* If destruction is pending (i.e., the wake-request flag is
> nonzero) and we
> -     are the last waiter (prior value of __wrefs was 1 << 3), then
> wake any
> -     threads waiting in pthread_cond_destroy.  Release MO to
> synchronize with
> -     these threads.  Don't bother clearing the wake-up request
> flag.  */
> -  if ((atomic_fetch_add_release (&cond->__data.__wrefs, -8) >> 2) ==
> 3)
> -    futex_wake (&cond->__data.__wrefs, INT_MAX, private);
> -}
> -
> 
>  /* Cancel waiting after having registered as a waiter
> previously.  SEQ is our
>     position and G is our group index.
> @@ -81,7 +69,7 @@ __condvar_cancel_waiting (pthread_cond_t *cond,
> uint64_t seq, unsigned int g,
>  {
>    bool consumed_signal = false;
> 
> -  /* No deadlock with group switching is possible here because we
> have do
> +  /* No deadlock with group switching is possible here because we do
>       not hold a reference on the group.  */
>    __condvar_acquire_lock (cond, private);
> 
> @@ -146,7 +134,7 @@ __condvar_cancel_waiting (pthread_cond_t *cond,
> uint64_t seq, unsigned int g,
> 
>  /* Wake up any signalers that might be waiting.  */
>  static void
> -__condvar_dec_grefs (pthread_cond_t *cond, unsigned int g, int
> private)
> +__condvar_confirm_wakeup (pthread_cond_t *cond, unsigned int g, int
> private)
>  {
>    /* Release MO to synchronize-with the acquire load in
>       __condvar_quiesce_and_switch_g1.  */
> @@ -172,7 +160,16 @@ __condvar_cleanup_waiting (void *arg)
>    pthread_cond_t *cond = cbuffer->cond;
>    unsigned g = cbuffer->wseq & 1;
> 
> -  __condvar_dec_grefs (cond, g, cbuffer->private);
> +  /* Normally we are not allowed to touch cond any more after
> calling
> +     confirm_wakeup, because pthread_cond_destroy looks at __g_refs
> to
> +     determine when all waiters have woken. Since we will do more
> work in this
> +     function, we are using an extra channel to communicate to
> +     pthread_cond_destroy that it is not allowed to finish yet: We
> increment
> +     the fourth bit on __flags. Relaxed MO is enough. The
> synchronization
> +     happens because confirm_wakeup uses release MO. */
> +  atomic_fetch_add_relaxed (&cond->__data.__flags, 8);
> +
> +  __condvar_confirm_wakeup (cond, g, cbuffer->private);
> 
>    __condvar_cancel_waiting (cond, cbuffer->wseq >> 1, g, cbuffer-
> >private);
>    /* FIXME With the current cancellation implementation, it is
> possible that
> @@ -183,7 +180,12 @@ __condvar_cleanup_waiting (void *arg)
>       conservatively.  */
>    futex_wake (cond->__data.__g_signals + g, 1, cbuffer->private);
> 
> -  __condvar_confirm_wakeup (cond, cbuffer->private);
> +  /* If destruction is pending (i.e., the wake-request flag is
> nonzero) and we
> +     are the last waiter (prior value of __flags was 1 << 3), then
> wake any
> +     threads waiting in pthread_cond_destroy.  Release MO to
> synchronize with
> +     these threads.  Don't bother clearing the wake-up request
> flag.  */
> +  if ((atomic_fetch_add_release (&cond->__data.__flags, -8) >> 2) ==
> 3)
> +    futex_wake (&cond->__data.__flags, INT_MAX, cbuffer->private);
> 
>    /* XXX If locking the mutex fails, should we just stop
> execution?  This
>       might be better than silently ignoring the error.  */
> @@ -287,13 +289,14 @@ __condvar_cleanup_waiting (void *arg)
>     __g1_orig_size: Initial size of G1
>       * The two least-significant bits represent the condvar-internal 
> lock.
>       * Only accessed while having acquired the condvar-internal
> lock.
> -   __wrefs: Waiter reference counter.
> +   __flags: Flags and reference count to handle
> pthread_cleanup_push/pop.
> +     * Bit 3 and above is the count of threads that had
> pthread_cancel()
> +       called on them. pthread_cond_destroy waits for this to reach
> 0.

Waiters can cancel waiting be because of pthread_cancel or because of
normal timeouts.

>       * Bit 2 is true if waiters should run futex_wake when they
> remove the
>         last reference.  pthread_cond_destroy uses this as futex
> word.
>       * Bit 1 is the clock ID (0 == CLOCK_REALTIME, 1 ==
> CLOCK_MONOTONIC).
>       * Bit 0 is true iff this is a process-shared condvar.
> -     * Simple reference count used by both waiters and
> pthread_cond_destroy.
> -     (If the format of __wrefs is changed, update
> nptl_lock_constants.pysym
> +     (If the format of __flags is changed, update
> nptl_lock_constants.pysym
>        and the pretty printers.)
>     For each of the two groups, we have:
>     __g_refs: Futex waiter reference count.

This now isn't just the ref count for futex waiters, but for all
waiters.

I think it would improve the patch if the comments would be updated to
match the code, and ideally explain what the intent is behind the new
code in more detail.

> @@ -301,6 +304,8 @@ __condvar_cleanup_waiting (void *arg)
>         last reference.
>       * Reference count used by waiters concurrently with signalers
> that have
>         acquired the condvar-internal lock.
> +     (If the format of __g_refs is changed, update
> nptl_lock_constants.pysym
> +      and the pretty printers.)
>     __g_signals: The number of signals that can still be consumed.
>       * Used as a futex word by waiters.  Used concurrently by
> waiters and
>         signalers.
> @@ -407,7 +412,8 @@ __pthread_cond_wait_common (pthread_cond_t *cond,
> pthread_mutex_t *mutex,
> 
>    /* Increase the waiter reference count.  Relaxed MO is sufficient
> because
>       we only need to synchronize when decrementing the reference
> count.  */
> -  unsigned int flags = atomic_fetch_add_relaxed (&cond-
> >__data.__wrefs, 8);
> +  atomic_fetch_add_acquire (cond->__data.__g_refs + g, 2);

Please explain why a particular MO is used in the comments.

> +  unsigned int flags = atomic_load_relaxed (&cond->__data.__flags);
>    int private = __condvar_get_private (flags);
>
>    /* Now that we are registered as a waiter, we can release the
> mutex.
> @@ -421,7 +427,7 @@ __pthread_cond_wait_common (pthread_cond_t *cond,
> pthread_mutex_t *mutex,
>    if (__glibc_unlikely (err != 0))
>      {
>        __condvar_cancel_waiting (cond, seq, g, private);
> -      __condvar_confirm_wakeup (cond, private);
> +      __condvar_confirm_wakeup (cond, g, private);
>        return err;
>      }
> 
> @@ -482,13 +488,9 @@ __pthread_cond_wait_common (pthread_cond_t
> *cond, pthread_mutex_t *mutex,
>  	     release MO when decrementing the reference count because
> we use
>  	     an atomic read-modify-write operation and thus extend the
> release
>  	     sequence.  */
> -	  atomic_fetch_add_acquire (cond->__data.__g_refs + g, 2);

See above.  Please check that the memory orders are correct and still
work, and update the comments accordingly.  I haven't checked them
myself yet, so I don't yet have an opinion either way.

>  	  if (((atomic_load_acquire (cond->__data.__g_signals + g) & 1)
> != 0)
>  	      || (seq < (__condvar_load_g1_start_relaxed (cond) >> 1)))
>  	    {
> -	      /* Our group is closed.  Wake up any signalers that might
> be
> -		 waiting.  */

The group is still closed, so I'd retain that part of the comment.

> -	      __condvar_dec_grefs (cond, g, private);
>  	      goto done;
>  	    }
> 
> @@ -508,7 +510,6 @@ __pthread_cond_wait_common (pthread_cond_t *cond,
> pthread_mutex_t *mutex,
> 
>  	  if (__glibc_unlikely (err == ETIMEDOUT || err == EOVERFLOW))
>  	    {
> -	      __condvar_dec_grefs (cond, g, private);

This is cancelling waiting too, so you need to add similar code as you
have done for cancellation because of pthread_cancel.

IIRC, you don't mode cancellation in your TLA+ version yet, right.  If
so, this is an example of why I said in my earlier response that any
verification is only as good as the model that one is using.

I believe you also need to handle your change to __g_refs on line 423
of pthread_cond_wait.c.