[PATCH] nptl: Fix pthread_cond_signal missing a sleeper (bug 25847)

Lucas A. M. Magalhaes lamm@linux.ibm.com
Wed Dec 9 13:30:34 GMT 2020


Hi Malte,
Thanks for working on this bug. I will make some suggestions.

Usually when a patch fixes a bug the commit subject have the
bugzilla number specified in the format "(#BZ XXXX)". So your patch
should be like:

nptl: Fix pthread_cond_signal missing a sleeper (#BZ 25847)

> There was a subtle, rare bug in the code that tried to handle the
> case when a thread took a long time to leave pthread_cond_wait,
> and accidentally stole a signal from a later group. As a result
> you could get a situation where pthread_cond_signal doesn't wake
> a thread even though there is a sleeping thread on the condition
> variable.
> 
> The fix is to make it impossible for a thread to ever get into
> that state, allowing us to remove the code that handles potential
> stealing of signals.
> 
> pthread_cond_signal would already wait for all sleeping threads
> to wake up before closing a group, but pthread_cond_wait had a
> shortcut code-path that didn't increment/decrement g_refs, which
> meant that pthread_cond_signal wouldn't wait for the thread in
> pthread_cond_wait to leave. If pthread_cond_signal then closed the
> group too early, and the thread in pthread_cond_wait took a while
> longer to leave the function, it could end up stealing a wake-up
> signal from a later group, which led to a state where a sleeping
> thread on the condition variable would not wake up. There was
> already code in place to try to fix this, but that code had a
> subtle bug.
> 
> In this change I ensure that each thread always increments and
> decrements g_refs, so that pthread_cond_signal can't accidentally
> close a group while there is still a thread in pthread_cond_wait.
> As a result, stealing of signals can't happen, and the code with
> the subtle bug could be removed.
> 
> This also meant that the increment/decrement of wrefs was no
> longer necessary, since g_refs now covers the same scope. I
> removed the increment and decrement of wrefs, which means that
> this change should not affect performance.
> 
> Because wrefs now changed meaning, I also renamed the variable
> from __wrefs to __flags.
> 

I found difficult to separate the fix for the renaming parts.
IMHO this patch could be splitted to make it easier to review.
It also improve git log readability and organization.

> Unfortunately I couldn't entirely get rid of all logic related to
> wrefs: There is one edge case related to pthread_cancel where I
> couldn't broaden the scope of g_refs as much as I would have liked
> to. In that case I kept the scope of g_refs narrow, and needed
> some other logic to synchronize with pthread_cond_destroy. I chose
> to keep the wrefs-related logic in pthread_cond_destroy around for
> that case. I still think the renaming of the variable is correct,
> because it only serves its old purpose when coordinating between
> pthread_cancel and pthread_cond_destroy, which is more of a corner
> case instead of the core function of the variable.
> 
> I had reproduced the bug in the formal specification language TLA+
> to better understand what is happening. I wrote a blog post about
> the experience here:
> 
> https://probablydance.com/2020/10/31/using-tla-in-the-real-world-to-understand-a-glibc-bug/
> 
Nice work. Looking forward to read it.

> This fix is also verified in TLA+. The TLA+ model is not included
> in this patch because I don't know what the format for that should
> be. Let me know if there is a good place for it.

IDK if its possible, but we could have a testcase for that.

---
Lucas A. M. Magalhães


More information about the Libc-alpha mailing list