Bug 13165 - pthread_cond_wait() can consume a signal that was sent before it started waiting
Summary: pthread_cond_wait() can consume a signal that was sent before it started waiting
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: 2.14
: P2 normal
Target Milestone: 2.25
Assignee: Torvald Riegel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-07 19:14 UTC by Mihail Mihaylov
Modified: 2017-01-01 21:32 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Test to observe the race (1.58 KB, text/x-c++src)
2011-09-25 21:32 UTC, Mihail Mihaylov
Details
Simpler test to observe the race (1.52 KB, text/x-c++src)
2011-09-27 10:10 UTC, Mihail Mihaylov
Details
Simpler test, converted to pure C (1.57 KB, text/plain)
2011-09-28 02:08 UTC, Rich Felker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mihail Mihaylov 2011-09-07 19:14:55 UTC
I was implementing something like a monitor on top of pthread condition variables and I observed some strange behaviour. I was always holding the mutex when calling pthread_cond_signal(). My code relied on only two assumptions about the way pthread_cond_signal() works:

1) A call to pthread_cond_signal() will wake at least one thread which is blocked on the condition, and the woken threads will start waiting on the mutex.

2) If the signaling thread holds the mutex when it calls pthread_cond_signal(), only threads which are already waiting on the condition variable may be woken. In particular, if the signaling thread releases the mutex and then another thread acquires the mutex and calls pthread_cond_wait(), the waiting thread cannot be woken by this signal, no matter what other waiters are present before or after the signal.

The only explanation that I could find for the observed behaviour was that my second assumption was wrong. It seemed that I was hitting the following scenario:

1) We have several threads which are blocked on the condvar in pthread_cond_wait(). I'll call these threads "group A".

2) We then send N signals from another thread while holding the mutex. We are releasing the mutex and acquiring it again between the signals.

3) Next we have several more threads (at least two) that acquire the mutex and enter pthread_cond_wait(). I'll call these threads "group B"

4) Then we acquire the mutex in the signaling thread again and call pthread_cond_signal() just once, then we release the mutex.

5) Two threads from group B wake up, and N-1 threads from group A wake up. In effect one of the threads from group B has stolen a signal that was sent before it started waiting from a thread from group A.

My expectation in this scenario is that at least N threads from group A should wake up. I don't expect that exactly one thread from group B should wake up, because spurious wakeups are possible. But this is not a spurious wakeup - I have N signals, and N woken threads, it's just that the order is wrong.

I ran some experiments and they seemed to confirm my theory, so I looked at the condvar implementation in nptl. I'm new to POSIX and Linux programing, but I think I see how this can happen:

1) When we send the first N signals, N threads from group A that are waiting on the cond->__data.__futex are woken and start waiting on cond->__data.__lock.

2) Then while the threads from group B enter pthread_cond_wait, some of the woken threads from group A may remain waiting on the lock.

3) When we send the last signal, one thread from group B will wake and consume this signal.

4) But suppose one more thread from group B wakes spuriously from lll_futex_wait. At this moment it is possible that some of the woken threads from group A will still be waiting on cond->__data.__lock. In that case the spuriously woken thread from group B will see that cond->__data.__wakeup_seq has changed (because of the last signal) and cond->__data._woken_seq has not reached cond->__data.__wakeup_seq (because some of the woken threads in group A are still waiting to acquire cond->__data.__lock), so it will exit the retry loop and increase cond->__data.__woken_seq. The result is that the thread will steal the signal.

Is this scenario really possible? And if it is, is this on purpose or is it a bug?
Comment 1 Mihail Mihaylov 2011-09-21 09:12:26 UTC
It's been two weeks with no update on the bug. There is no way to tell if someone has noticed it at all.

Is it possible to have at least some statement as of if this is considered a valid bug that might be addressed at some point in the future?
Comment 2 Rich Felker 2011-09-21 18:18:43 UTC
I've waited months for a response to some of my bug reports for NPTL, all race conditions, some of which make it impossible to use major features like cancellation in a robust program. Nothing. Good luck, but don't hold your breath. The best thing I can recommend to you is cross-posting the bug report to distros' bug trackers, and getting other people interesting these issues to follow up on the bug tracker with confirmations that they can reproduce it.
Comment 3 Rich Felker 2011-09-21 22:29:21 UTC
Can you explain how you know (2) is completed before (3) occurs, in your scenario? If there's no synchronization to order these steps, then isn't it possible that one or more of the signals happens after a thread from group B is waiting?

If you have a minimal self-contained test case for the issue, I'd be interested in seeing it.
Comment 4 Mihail Mihaylov 2011-09-22 22:21:10 UTC
Thank you for taking an interest in this issue.

(In reply to comment #3)
> Can you explain how you know (2) is completed before (3) occurs, in your
> scenario? If there's no synchronization to order these steps, then isn't it
> possible that one or more of the signals happens after a thread from group B is
> waiting?

Basically, because we are holding the mutex when signaling, we can tell exactly which threads started waiting after we finished sending the first N signals. These are the threads that I call "group B", so by this definition they cannot start waiting before all signals from step 2 have been sent.

What I'm trying to say is that the scenario is not a test case, but rather a hypothetical  sequence of events that can happen and can be observed, so it doesn't specify why exactly no new threads started waiting during step 2, it just says what happened. This left some ambiguity in my description.

One way to resolve this ambiguity is to say that if during step 2 some threads acquired the mutex and called pthread_cond_wait(), they should be counted towards group A.

Another way is to change step 2 and say that the signaling thread acquired the mutex, sent N signals and only then released the mutex, without releasing it between the signals.

The second way seems simpler and will probably make the race more likely, but the first is closer to what I actually observed.
 
> If you have a minimal self-contained test case for the issue, I'd be interested
> in seeing it.

I don't have such a test case, but I'll try to find time in the next days to write one and attach it to the bug.
Comment 5 Mihail Mihaylov 2011-09-25 21:32:35 UTC
Created attachment 5945 [details]
Test to observe the race

Attaching a self contained test. What the test does:

We have a mutex and a condition variable. We also have several auxiliary condition variables and counters.

The main thread locks the mutex and creates as many waiter threads as possible. The waiter threads start by waiting on the mutex. Then both the main thread and the waiter thread start looping to perform iterations of the test until the race condition in NPTL is hit.

The loops of the main thread and the waiter threads are synchronized and go like this:

1) The main thread starts by releasing the mutex and blocking on an auxiliary condvar. This unblocks the waiter threads which start by entering the first wait on the condition variable. Each waiter thread increments a waiters counter before waiting and the last one also signals the auxiliary condvar to notify the main thread that all waiters are blocked on the first wait.

2) When all waiters are blocked on the first wait, the main thread is unblocked and starts sending signals. It sends as many signals as there are waiters, so all waiters should move (eventually) beyond the first wait. The main thread holds the mutex while sending the signals. The 'releaseMutexBetweenSignals' constant controls whether it will release and reacquire the mutex between signals.

3) Each unblocked waiter decrements the waiters counter and moves to the second wait. To simplify the test, the waiters don't enter the second wait until all signals from step 2 have been sent. This is controlled through a sent signals counter and another auxiliary condvar.

4) After the main thread has sent all signals, it starts waiting for at least two waiters to block on the second wait. This is facilitated by a counter of the threads that have reached the second wait and one more auxiliary condvar.

5) When at least two threads have blocked on the second wait, the main thread sends one more signal. Threads that get unblocked from the second wait may start a third wait to allow the test iteration to complete before they loop back to the first wait (of course this actually happens when the main thread releases the mutex in step 6)

6) The main thread starts waiting for all waiters to exit the first wait. Each waiter that exits the first wait decrements the waiters count and the last one signals the last auxiliary condvar that the main thread waits on. If the wait times out, the test has failed, otherwise it has passed.

7) If the test has passed, all waiters are waiting on the condition variable in the second wait or the third wait, so the main thread sends a broadcast to unblock them and all waiters move back to the first wait. With this the test iteration is complete and a new iteration begins.


The main point about this test is that at the point where the main thread sends the single signal, all waiters should be:

1) either waiting on the mutex in the first wait,
2) or waiting on the condition variable in the second wait,
3) or waiting on the mutex in the wait on the auxiliary condvar from step 3

which means that if the mutex gets released for long enough, all threads should acquire the mutex in the first wait and the waiters count should eventually reach zero. Step 6 is meant to provide this time. At step 6, the main thread releases the mutex and starts waiting, and all waiters that acquire the mutex release it almost immediately and start waiting themselves, so there is nothing to prevent the threads from group (1) above from acquiring the mutex one by one and bringing the waiters counter back to 0. The only thing that can get in the way is if there is a waiter which is still blocked on the condition variable in the first wait, which is what the test aims to trigger and detect.
Comment 6 Mihail Mihaylov 2011-09-25 21:43:17 UTC
(In reply to comment #3)
> If you have a minimal self-contained test case for the issue, I'd be interested
> in seeing it.

I wrote a self-contained test case. It is not very simple, and may seem contrived. My real code is actually much simpler (just one condition variable), but it is not suitable as a test.

I ran the test on my dual-core laptop and had consistent results of one stolen signal. I guess that with more cores it will be possible for more signals to be stolen. Typically it took from 3 to 40 iterations to hit the race.

I would appreciate your feedback.
Comment 7 Mihail Mihaylov 2011-09-26 09:13:37 UTC
I ran the test on my workstation too. It took much longer to hit the race. The difference is that on my workstation a process can start over 30000 threads as opposed to about 400 on my laptop. Limiting the number of created threads to several hundred made the race much easier to hit.
Comment 8 Rich Felker 2011-09-26 16:19:50 UTC
Could you simplify it? I'm not sure why the test has to be so much more complicated than your real code. If you think there's a race/bug in cond vars, then it's probably not a good idea to be using cond vars in the logic for the test loop, since it will be hard to tell if the bug is occurring in the tested code or the control code. Perhaps you could use barriers or semaphores to synchronize the test control logic, if needed. Personally barriers are my favorite synchronization primitive for that sort of thing.

Also, yes, get rid of the "make as many threads as possible" logic. That will not make it any easier to find the race, and the limit to how many threads you can make is usually dependent on available memory/virtual address space, not kernel thread resources. Just pick a "sane" number (probably below 100) and go with it.
Comment 9 Mihail Mihaylov 2011-09-27 10:10:23 UTC
Created attachment 5946 [details]
Simpler test to observe the race

This is a much simpler test case that demonstrates the race. The logic is pretty much the same as with the previous test but it doesn't use any auxiliary condition variables.
Comment 10 Mihail Mihaylov 2011-09-27 10:13:25 UTC
(In reply to comment #8)
> Could you simplify it? I'm not sure why the test has to be so much more
> complicated than your real code. If you think there's a race/bug in cond vars,
> then it's probably not a good idea to be using cond vars in the logic for the
> test loop, since it will be hard to tell if the bug is occurring in the tested
> code or the control code. Perhaps you could use barriers or semaphores to
> synchronize the test control logic, if needed. Personally barriers are my
> favorite synchronization primitive for that sort of thing.
> 
> Also, yes, get rid of the "make as many threads as possible" logic. That will
> not make it any easier to find the race, and the limit to how many threads you
> can make is usually dependent on available memory/virtual address space, not
> kernel thread resources. Just pick a "sane" number (probably below 100) and go
> with it.

I simplified the test significantly as you suggested. Now its somewhat simpler than my actual code.
Comment 11 Rich Felker 2011-09-28 02:06:54 UTC
I've confirmed that the issue occurs on my Debian system with their libc6 package (eglibc 2.13-10, albeit slightly different from glibc). I've also confirmed that the issue does not occur with my implementation of condition variables in musl libc(*). I suspect it's a real bug, but I need to read the code more closely to understand what's going on...

(*) I've converted the test to C (just replaced cout with printf) so I can run it without C++ support. I'm attaching the modified version.
Comment 12 Rich Felker 2011-09-28 02:08:00 UTC
Created attachment 5947 [details]
Simpler test, converted to pure C
Comment 13 Mihail Mihaylov 2011-09-28 09:02:51 UTC
(In reply to comment #11)
> I've confirmed that the issue occurs on my Debian system with their libc6
> package (eglibc 2.13-10, albeit slightly different from glibc).

I originally observed the problem on a Debian stable. I've run my test case on my laptop which is running Mint and on my office workstation which is running kubuntu.

I looked at the eglibc source code before posting the bug and saw that the code which causes the race is identical to the one in glibc, so the bug is in both implementations.

> I've also confirmed that the issue does not occur with my implementation of
> condition variables in musl libc(*).

I took a look at your code. As far as I can tell, you are not trying to avoid spurious wakeups as hard as glibc, that's why you don't have the same race.

> I suspect it's a real bug, but I need to read the code more closely to
> understand what's going on...

Here is my understanding of the root cause - an attempt to prevent spurious wakeups that has gone too far and destroys ordering - waking future waiters instead of present ones.

There are two checks that NPTL uses to prevent spurious wakeups:

1) It only allows a thread to wake if a signal has been sent after it started waiting. This is achieved by checking if cond->__data.__wakeup_seq has remained unchanged.

2) It only allows as many threads to wake up as there were signals. This is achieved by checking if cond->__data._woken_seq equals cond->__data.__wakeup_seq.

If any of this checks indicates a spurious wakeup the thread retries the wait.

The problem is in check 2, because the guard is triggered if any thread has woken spuriously - not just the current thread. Worse - it is triggered only after the spuriously woken thread consumed a signal. So in many cases the spuriously woken thread consumes the signal, and a validly woken thread is forced to retry. The result is that a spurious wakeup may steal signals that were sent before it started waiting.

Now, I'm confident that the race is real. But maybe some people would disagree that it is a bug. That's why I asked in my original message if this behaviour is intentional or a bug.

It is a bug if pthread condition variables should support the following usage: 

   ...

   pthread_mutex_lock(&m);

   SomeType localState = f(sharedState);

   while ( predicate(sharedState, localState) ) {
      pthread_cond_wait(&c, &m);
   }

   ...

In this case it actually matters which thread will wake up, because if the wrong thread wakes up, it will retry the wait and the signal will be lost (this is what happened to me). Unfortunately the spec is not very clear on the issue. But this is the pattern that the pthread_cond_wait implementation in glibc itself uses to detect spurious wakeups on the futex.
Comment 14 Rich Felker 2011-09-28 16:06:21 UTC
I think it's clear that the usage you describe needs to be supported. Nowhere does the standard state that predicate must be a pure function of the shared (mutex-protected) state, or that it even need to depend on the shared state whatsoever. For example, this is a valid albeit potentially stupid use of cond variables:

Thread 1:

Lock mutex.
Create thread 2.
Wait on cond var.
Unlock mutex.
Continue with other work.

Thread 2:

Lock mutex.
Signal cond var.
Unlock mutex.
Continue with other work.

There is no guarantee that thread 1 will actually block until thread 2 starts and signals (it could have a spurious wakeup), but there is a guarantee that the signal will cause a wakeup if one has not already occurred. Naturally this example is not analogous to your test (there's nobody else using the cond var), but the point is that it's valid to use cond vars even without predicates. If you read the language of the standard, which is in terms of threads currently blocked rather than predicates, it's clear.

By the way, assuming spurious wakeups are rare, my above "stupid" use of cond vars may actually be an efficient way to roughly synchronize typical start times of parallel calculations, perhaps useful in benchmarks or in tweaking cache utilization for a particular machine. Barriers would probably be more appropriate, however.
Comment 15 Mihail Mihaylov 2011-09-28 20:59:15 UTC
Well, you don't need to convince me. I think this is a bug too. But based on how Ulrich Drepper responded to Bug 12875, he might claim that this is not a bug either. Bug 12875 is almost certainly a manifestation of the same race.
Comment 16 Torvald Riegel 2012-09-19 15:15:02 UTC
I'm not aware of any requirement that pthread_cond_signal should block until a waiter has actually woken up. (Your test case relies on it to not block, so that it can send out multiple signals while holding the mutex, right?)  I'm also not aware of any ordering requirement wrt. waiters (i.e., fairness).  If you combine both, you will see that the behavior you observe is a valid execution.

(In reply to comment #5)
> 4) After the main thread has sent all signals, it starts waiting for at least
> two waiters to block on the second wait. This is facilitated by a counter of
> the threads that have reached the second wait and one more auxiliary cond var.

And here you do block for waiters to have consumed a signal (i.e., for a call to pthread_cond_signal to have finished its work and delivered a signal), but you do this just for two of the signals / calls.

If you do not want to wait for all signals being delivered yet still need a fair cond var implementation, I suggest either (1) building your own (e.g., you could build something like a simple queue lock based on pthread mutexes or cond vars) or (2) propose an addition of a fair cond var to glibc or the respective standards bodies.
Comment 17 Torvald Riegel 2012-09-19 15:20:45 UTC
(In reply to comment #13)
> It is a bug if pthread condition variables should support the following usage: 
> 
>    ...
> 
>    pthread_mutex_lock(&m);
> 
>    SomeType localState = f(sharedState);
> 
>    while ( predicate(sharedState, localState) ) {
>       pthread_cond_wait(&c, &m);
>    }
> 
>    ...
> 
> In this case it actually matters which thread will wake up, because if the
> wrong thread wakes up, it will retry the wait and the signal will be lost (this
> is what happened to me). Unfortunately the spec is not very clear on the issue.

If the spec doesn't guarantee something, it's usually best to not depend on this.

> But this is the pattern that the pthread_cond_wait implementation in glibc
> itself uses to detect spurious wakeups on the futex.

Not quite.  For one, pthread_cond_wait doesn't try to establish the ordering that you seem to depend on the piece of code above.
Comment 18 Rich Felker 2012-09-19 17:23:20 UTC
> I'm not aware of any requirement that pthread_cond_signal should block until a
> waiter has actually woken up. (Your test case relies on it to not block, so

Do you mean "should block" or "should not block"? POSIX's definition of threads has general language that requires that forward process be possible, and blocking in pthread_cond_signal until a waiter actually wakes up (including acquiring the mutex) would be non-conformant; in fact it would be a guaranteed deadlock.

> that it can send out multiple signals while holding the mutex, right?)  I'm

It is definitely required that pthread_cond_signal can be called more than once.

> also not aware of any ordering requirement wrt. waiters (i.e., fairness).  If
> you combine both, you will see that the behavior you observe is a valid
> execution.

Nobody has asked for glibc to satisfy any fairness constraint. The standard says it shall unblock at least one thread that "has blocked" on the condition variable, not that it can randomly fail to do so or instead send the unblocking event into the future to be consumed by another thread that has not yet blocked but is about to block (after the signaling thread unlocks the mutex). The claim in this bug report is that glibc is randomly failing (race condition) to unblock ANY of the threads that have blocked. A viable mechanism of how this failure is occurring has also been proposed.

You are free to test this hypothesis and claim that this is not what's happening, or provide a rigorous proof that it couldn't happen. But what you've been doing so far is mischaracterizing the bug report and my comments as a complaint about scheduling fairness, which they are not, and this is getting really frustrating...
Comment 19 Mihail Mihaylov 2012-09-20 10:21:39 UTC
(In reply to comment #16)
Sorry for the long reply. Please, bare with me, because this issue is very subtle and I don't know how to explain it more succinctly.

First of all, let me clarify that this is a test that exposes the race, and not the usage scenario that I claim should be supported. The usage scenario is described in the bug description. Well, actually, I do claim that the scenario in the test should be supported too, but the scenario in the description makes more sense.

> I'm not aware of any requirement that pthread_cond_signal should block until a
> waiter has actually woken up. (Your test case relies on it to not block, so
> that it can send out multiple signals while holding the mutex, right?)  I'm
> also not aware of any ordering requirement wrt. waiters (i.e., fairness).  If
> you combine both, you will see that the behavior you observe is a valid
> execution.

I'm not making any assumptions about the state of the waiters when pthread_cond_signal returns. All I'm assuming is that, no matter if the signaling thread releases and reacquires the mutex after each sent signal or sends all signals without releasing the mutex, at least as many waiters as the number of signals will wake (eventually).

But even if this assumption is wrong (and it's not), if you set releaseMutexBetweenSignals to true, the test will release the mutex after each sent signal. In this case the test doesn't send multiple signals while holding the mutex, and the problem still occurs.

As for fairness, this is not about fairness. It is also not about ordering between the waiters. It's about ordering between waiters and signalers.

I'm getting tired of people jumping to fairness at the first mention of ordering. You could say that I'm requesting fairness if I wanted the first single signal to wake the waiter that blocked first. But all I'm requesting is for the signal to wake at least one of the waiters that started waiting before the signal was sent. I don't care which one of them.

This is guaranteed by the standard (from the documentation of pthread_cond_wait and pthread_cond_signal on the opengroup site):

"The pthread_cond_signal() function shall unblock at least one of the threads that are blocked on the specified condition variable cond (if any threads are blocked on cond)."

And I think the next quote makes it very clear what threads are considered to be blocked on the condvar at the time of the call to pthread_cond_signal():

"That is, if another thread is able to acquire the mutex after the about-to-block thread has released it, then a subsequent call to pthread_cond_broadcast() or pthread_cond_signal() in that thread shall behave as if it were issued after the about-to-block thread has blocked."

In effect this means that each call to pthread_cond_signal() defines a point in time and all waiters (or calls to pthread_cond_wait() if you prefer) are either before this call, or after it. And only the ones that are before it are allowed to consume the signal sent by this call.

Now, of course in a multiprocessor system it is hard to order events in time, but that's where the mutex comes in. And if the signaling thread sends multiple signals while holding the mutex, we can consider all these signals to be simultaneous. But that doesn't change the validity of the test.

On the other hand, the standard doesn't guarantee that there won't be spurious wakeups. However, glibc tries to prevent them. But the logic for this prevention is flawed and causes the race that this bug is about.

So the net result is that glibc chose to provide a feature that is not required, but dropped a much more important feature which is actually required. Hence, this bug is not a fairness feature request, it is a correctness defect report.
Comment 20 Torvald Riegel 2012-09-20 10:43:00 UTC
(In reply to comment #18)
> > I'm not aware of any requirement that pthread_cond_signal should block until a
> > waiter has actually woken up. (Your test case relies on it to not block, so
> 
> Do you mean "should block" or "should not block"?

"Block", as I wrote.

> POSIX's definition of threads
> has general language that requires that forward process be possible, and
> blocking in pthread_cond_signal until a waiter actually wakes up (including
> acquiring the mutex) would be non-conformant; in fact it would be a guaranteed
> deadlock.

That's what I'm pointing out.  Glad you agree on this point.

> 
> > that it can send out multiple signals while holding the mutex, right?)  I'm
> 
> It is definitely required that pthread_cond_signal can be called more than
> once.
> 
> > also not aware of any ordering requirement wrt. waiters (i.e., fairness).  If
> > you combine both, you will see that the behavior you observe is a valid
> > execution.
> 
> Nobody has asked for glibc to satisfy any fairness constraint.

Glad you agree on that too. Now please just think about the combination of both for a minute.

> The standard
> says it shall unblock at least one thread that "has blocked" on the condition
> variable, not that it can randomly fail to do so or instead send the unblocking
> event into the future to be consumed by another thread that has not yet blocked
> but is about to block (after the signaling thread unlocks the mutex).

You assume that only some threads can count as blocked, which is not guaranteed by the standard. Signalers are not required to have finished delivering the signal when they return.  Thus, the scope of which threads are blocked can be longer than you assume.  Combined with no fairness guarantee this exactly allows the behavior that we're talking about here.

The standard indeed doesn't talk about the "future".  It doesn't make a sort of lower-bound requirement on which threads have to be considered blocked, but no upper bound.  If you think there's an upper bound, please point the requirement in the standard.  If there is no required upper bound, it's up to the implementation how to deal with that.

> The claim
> in this bug report is that glibc is randomly failing (race condition) to
> unblock ANY of the threads that have blocked.

No.  The claim is that the "wrong" threads have been unblocked, where "wrong" is based on an assumption on ordering or an upper-bound-on-blocked-threads guarantee that is not required by the standard.

> A viable mechanism of how this
> failure is occurring has also been proposed.
> 
> You are free to test this hypothesis and claim that this is not what's
> happening, or provide a rigorous proof that it couldn't happen.

I have argued that this is allowed behavior, because there is no requirement that conflicts with it, and pointed out why.  You haven't done that; just because you think the standard should have a certain requirement doesn't mean it actually has.

> But what you've
> been doing so far is mischaracterizing the bug report and my comments as a
> complaint about scheduling fairness, which they are not, and this is getting
> really frustrating...

I've never talked about "scheduling fairness" (assuming you mean the OS scheduler here), just about fairness regarding which thread gets wakened. So much for mischaracterization, eh?

Sorry to hear that you feel frustrated, but I'd like to point out that I don't think that I'm the cause for this.

Bottom line: In my opinion, this is not a bug.  However, it might be good to explain why this behavior is allowed (e.g., somewhere in the docs or on the wiki), so that this doesn't surprise other users.
Comment 21 Mihail Mihaylov 2012-09-20 11:05:21 UTC
(In reply to comment #20)
> The standard indeed doesn't talk about the "future".  It doesn't make a sort of
> lower-bound requirement on which threads have to be considered blocked, but no
> upper bound.  If you think there's an upper bound, please point the requirement
> in the standard.  If there is no required upper bound, it's up to the
> implementation how to deal with that.

"The pthread_cond_broadcast() and pthread_cond_signal() functions shall have no effect if there are no threads currently blocked on cond."

How about this as an upper bound? If implementations are allowed to determine the set of blocked threads at any point in time they see fit, there would be no way to define "currently blocked" at all and this sentence couldn't make any sense.

And also:

".... however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal()."

If I accept your argument, there will be no way to determine at least a set of threads from which the woken thread will be chosen, so why does the standard talk about predictability?
Comment 22 Torvald Riegel 2012-09-20 11:22:45 UTC
(In reply to comment #19)
> (In reply to comment #16)
> Sorry for the long reply. Please, bare with me, because this issue is very
> subtle and I don't know how to explain it more succinctly.

No worries.  If it would be straight-forward, we wouldn't be talking about it here :)

> 
> First of all, let me clarify that this is a test that exposes the race, and not
> the usage scenario that I claim should be supported. The usage scenario is
> described in the bug description. Well, actually, I do claim that the scenario
> in the test should be supported too, but the scenario in the description makes
> more sense.

I think your test is good and helps to point out the issue (including, actually, why it's not a bug).  In your comment #1, I agree on assumption 1), but not on assumption 2). The latter (2) is not required by the standard.

> > I'm not aware of any requirement that pthread_cond_signal should block until a
> > waiter has actually woken up. (Your test case relies on it to not block, so
> > that it can send out multiple signals while holding the mutex, right?)  I'm
> > also not aware of any ordering requirement wrt. waiters (i.e., fairness).  If
> > you combine both, you will see that the behavior you observe is a valid
> > execution.
> 
> I'm not making any assumptions about the state of the waiters when
> pthread_cond_signal returns.

You do assume that only those will be associated with this particular signal call.  We can argue that whether this is part of the state of waiters or not, but I hope you'll agree that it's at least a property that the waiters are part of.

> All I'm assuming is that, no matter if the
> signaling thread releases and reacquires the mutex after each sent signal or
> sends all signals without releasing the mutex, at least as many waiters as the
> number of signals will wake (eventually).

This assumption is correct, but you assume more:  That only the threads before cond_signal returns will be considered as blocked.

So, if you have waiters W and signal S, and an execution of (W1->W2->S->W3), then the standard requires that W1 and W2 need to be considered as blocked threads by S.  It does not require that W3 is _not_ considered as a blocked thread.

Informally, S is required to be ordered after W1 and W2, but it is unspecified whether it is ordered before W3. S does not block; it is like starting an asynchronous operation that will eventually deliver a signal.  S can return earlier, and the earlier return can happen before W3, but that doesn't mean anything for the delivery of the signal.

> But even if this assumption is wrong (and it's not), if you set
> releaseMutexBetweenSignals to true, the test will release the mutex after each
> sent signal. In this case the test doesn't send multiple signals while holding
> the mutex, and the problem still occurs.
> 
> As for fairness, this is not about fairness. It is also not about ordering
> between the waiters. It's about ordering between waiters and signalers.

The ordering between waiters and signalers is the first point, which I illustrated with "signalers don't block for waiters".  After that, we have the problem of which of the blocked threads the signal chooses to wake up, so this becomes an ordering problem (i.e., a selection problem if applied continuously). On an abstract level, this is a typical fairness problem.

As I pointed out, it's the combination of both these issues.

> I'm getting tired of people jumping to fairness at the first mention of
> ordering.

If you're tired, get some sleep before suggesting that the participants of this discussion don't understand ordering.

> You could say that I'm requesting fairness if I wanted the first
> single signal to wake the waiter that blocked first. But all I'm requesting is
> for the signal to wake at least one of the waiters that started waiting before
> the signal was sent. I don't care which one of them.

That's the first wrong assumption ("before the signal" [call returned]).  If you don't make that incorrect assumption, then it becomes a fairness issue among classes of waiters, where the classes are defined by whether they happened before a certain signal call.

> This is guaranteed by the standard (from the documentation of pthread_cond_wait
> and pthread_cond_signal on the opengroup site):
> 
> "The pthread_cond_signal() function shall unblock at least one of the threads
> that are blocked on the specified condition variable cond (if any threads are
> blocked on cond)."
>
> And I think the next quote makes it very clear what threads are considered to
> be blocked on the condvar at the time of the call to pthread_cond_signal():

It says that a thread must be considered blocked if the cond_wait call happened before the signal call.  It does NOT say that ONLY those threads need to be considered blocked.

> "That is, if another thread is able to acquire the mutex after the
> about-to-block thread has released it, then a subsequent call to
> pthread_cond_broadcast() or pthread_cond_signal() in that thread shall behave
> as if it were issued after the about-to-block thread has blocked."
> 
> In effect this means that each call to pthread_cond_signal() defines a point in
> time and all waiters (or calls to pthread_cond_wait() if you prefer) are either
> before this call, or after it. And only the ones that are before it are allowed
> to consume the signal sent by this call.

No it does not.  It only talks about what happens before (=> meaning it implies):

waiter happens-before signaler => waiter is considered blocked

It does not say (<=> meaning being equivalent to):

waiter happens-before signaler <=> waiter is considered blocked

(In fact, if you spin the second (<=>) further, you could only expect / build an ordering based on the mutex, but not for other kinds of enforced happens-before order. That's not what we'd want.)

> On the other hand, the standard doesn't guarantee that there won't be spurious
> wakeups. However, glibc tries to prevent them. But the logic for this
> prevention is flawed and causes the race that this bug is about.

The term "race" here is misleading.  If there's a race, in it's how you use the cond var and expect it to behave.  The logic would be flawed if it allowed incorrect behavior, which it doesn't.

> So the net result is that glibc chose to provide a feature that is not
> required, but dropped a much more important feature which is actually required.
> Hence, this bug is not a fairness feature request, it is a correctness defect
> report.

No.  You assume a guarantee that isn't required by the standard.

The comment that you are asking for a fairness feature is an attempt at an explanation for you that should point out the difference to what the standard guarantees.  This was meant to help in this discussion.  If we want to further discuss this, I believe this needs to focus around whether the standard requires that certain threads must not be considered blocked.
Comment 23 Torvald Riegel 2012-09-20 11:58:26 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > The standard indeed doesn't talk about the "future".  It doesn't make a sort of
> > lower-bound requirement on which threads have to be considered blocked, but no
> > upper bound.  If you think there's an upper bound, please point the requirement
> > in the standard.  If there is no required upper bound, it's up to the
> > implementation how to deal with that.
> 
> "The pthread_cond_broadcast() and pthread_cond_signal() functions shall have no
> effect if there are no threads currently blocked on cond."
> 
> How about this as an upper bound?

This states something in relation to those threads that are considered to be blocked.  It does not state anything about which threads can or have to be considered to be blocked.  So, it can't be an upper bound.

> If implementations are allowed to determine
> the set of blocked threads at any point in time they see fit, there would be no
> way to define "currently blocked" at all and this sentence couldn't make any
> sense.

There is a lower bound (or minimum requirement) based on the happens-before via the mutex (hence "currently").  The sentence allows the implementation to let the signal have no effect if there is no thread that has to be considered blocked with the assumption of the lower bound.  Assuming more threads to be blocked is the same as allowing spurious wake-ups.

> And also:
> 
> ".... however, if predictable scheduling behavior is required, then that mutex
> shall be locked by the thread calling pthread_cond_broadcast() or
> pthread_cond_signal()."
> 
> If I accept your argument, there will be no way to determine at least a set of
> threads from which the woken thread will be chosen, so why does the standard
> talk about predictability?

There is the lower bound, which does determine properties of this set.
Comment 24 Mihail Mihaylov 2012-09-20 12:46:24 UTC
(In reply to comment #23)
> This states something in relation to those threads that are considered to be
> blocked.  It does not state anything about which threads can or have to be
> considered to be blocked.  So, it can't be an upper bound.

It doesn't say "blocked at the time the signal is delivered". It says "currently blocked". And it doesn't talk about the effect of the delivery of the signal, it talks about the effect of the function. So "current" and "function" should be related somehow. I'm taking the relation to be that "current" means "at the time of the call to the function".

But the language of the standard is too vague to make it explicit what "currently" means here, so I guess you are free to interpret it otherwise.

> There is a lower bound (or minimum requirement) based on the happens-before via
> the mutex (hence "currently").  The sentence allows the implementation to let
> the signal have no effect if there is no thread that has to be considered
> blocked with the assumption of the lower bound.  Assuming more threads to be
> blocked is the same as allowing spurious wake-ups.

I don't really buy your argument here, but I just realized that this sentence only talks about the situation when there are no blocked threads, so I cannot use it to reason about the case when there are blocked waiters before the call.

Again, the vague language of the standard allows your interpretation.

> > If I accept your argument, there will be no way to determine at least a set of
> > threads from which the woken thread will be chosen, so why does the standard
> > talk about predictability?
> 
> There is the lower bound, which does determine properties of this set.

What I meant is that if we accept your interpretation, there are no threads that can be excluded from the set of blocked threads - except the ones that never called, nor will ever call pthread_cond_wait() and the ones that have already returned from all the calls to pthread_cond_wait() they will ever make.

To illustrate this, let me take what you're saying to its logical conclusion.

Suppose for simplicity that we have a single call to pthread_cond_signal() and many calls to pthread_cond_wait(), both before and after the call to pthread_cond_signal(). What you are actually saying is that it is correct for the call to pthread_cond_signal() to consider all of these threads to be blocked, and is allowed to wake any of them, even threads that are not created yet.

Even more, if the signaling thread calls pthread_cond_wait() some time (even hours) after it called pthread_cond_signal(). it is correct (although undesirable) for it to consume its own signal and all the rest waiters to become blocked.

Although I agree that the spec allows this interpretation, I think it is highly impractical and unintuitive, that's why I believe it is not what the authors had in mind. But of course, the only resolution of this argument at this point is to ask them for clarification. So I don't see any point in arguing any more without involving them.

Meanwhile, just consider this: you have code in the implementation which tries to prevent spurious wakeups and it basically aims to establish a timeline of the calls to pthread_cond_signal() and pthread_cond_wait() and assumes that a wakeup is spurious if it occurred without a signal being sent after the respective thread blocked. I dare say that the code suggests that whoever wrote this code had the same assumptions about ordering as me. And it actually contradicts your interpretation, because with your interpretation a single counter of outstanding signals would be enough.

BTW, the reason why this code fails to work correctly is very simple - you can't detect spurious wakeups reliably using constant memory without giving up all ordering guarantees.
Comment 25 Mihail Mihaylov 2012-09-20 12:49:05 UTC
(In reply to comment #24)
> Even more, if the signaling thread calls pthread_cond_wait() some time (even
> hours) after it called pthread_cond_signal(). it is correct (although
> undesirable) for it to consume its own signal and all the rest waiters to
> become blocked.

I meant "to remain blocked"...
Comment 26 Torvald Riegel 2012-09-20 16:21:01 UTC
(In reply to comment #24)
> Although I agree that the spec allows this interpretation, I think it is highly
> impractical and unintuitive, that's why I believe it is not what the authors
> had in mind. But of course, the only resolution of this argument at this point
> is to ask them for clarification. So I don't see any point in arguing any more
> without involving them.

If the requirements in the specification should change in the future, we can look at this again. I personally don't think that this makes cond vars significantly impractical or unintuitive, but that is a trade-off that the standards body will have to make (or probably did when it produced the current spec).

Until then, I suppose that you now agree that this is correct behavior according to the current spec.

> Meanwhile, just consider this: you have code in the implementation which tries
> to prevent spurious wakeups and it basically aims to establish a timeline of
> the calls to pthread_cond_signal() and pthread_cond_wait()

Not a timeline, but just overall number of signals or threads that started waiting.

> and assumes that a
> wakeup is spurious if it occurred without a signal being sent after the
> respective thread blocked.

It is spurious if there were not enough signals being sent, or other threads consumed the signals that were sent.

> I dare say that the code suggests that whoever wrote
> this code had the same assumptions about ordering as me.

Looking at the git log for the respective files, you should take that question to Ulrich or Jakub.

> And it actually
> contradicts your interpretation, because with your interpretation a single
> counter of outstanding signals would be enough.

I doubt that (if you want to do this efficiently).

> BTW, the reason why this code fails to work correctly is very simple - you
> can't detect spurious wakeups reliably using constant memory without giving up
> all ordering guarantees.

I think you can (just combine something like a ticket lock with the futex) -- but this won't be efficient.
Comment 27 Rich Felker 2012-09-20 18:39:10 UTC
I disagree strongly that the spec even allows Torvald's interpretation. Torvald's claim is essentially that an implementation can consider an unspecified set of threads beyond those which "have blocked" per the specification of pthread_cond_wait to also "have blocked" on the condition. Not only is there no language in the standard to support this (the only definition of "blocked" on a condition variable is the one we've cited); it would also make the specification of pthread_cond_signal completely useless, as the "at least one" could always be taken to mean "the one invisible thread your application can't see that's always blocked on every possible condition variable". This is the pinnacle of absurdity in an attempt to language-lawyer out of fixing a bug.

The fact of the matter is that POSIX, along with common sense and the entire intended usage of condition variables, requires pthread_cond_signal to unblock at least one thread that "has blocked", in the sense of the blocking that happens atomically with unlocking the mutex as described in the specification of pthread_cond_wait.

The situation we're looking at here is that the authors of NPTL came up with a clever hack to reduce spurious wakeups in pthread_cond_wait, but failed to realize the hack was broken in some corner cases, and now Torvald is unwilling to admit that the hack is broken and take it out.

It should also be noted that I have been unable to demonstrate any case where NPTL's hack to prevent spurious wakeups actually has any positive effect. A while back I wrote a test program to hammer condition variables and look for spurious wakeups, and I could not generate any with either NPTL's buggy implementation OR with musl's implementation which does not use any similar hack and does not suffer from this bug. Thus, in addition to being wrong and broken, it's my conclusion, unless anybody else can produce evidence to the contrary, that the hack is also useless (does not have the intended effect of improving performance).
Comment 28 Mihail Mihaylov 2012-09-20 19:48:25 UTC
(In reply to comment #27)
> I disagree strongly that the spec even allows Torvald's interpretation.
> Torvald's claim is essentially that an implementation can consider an
> unspecified set of threads beyond those which "have blocked" per the
> specification of pthread_cond_wait to also "have blocked" on the condition.

Yes, that's what he claims.

> Not
> only is there no language in the standard to support this (the only definition
> of "blocked" on a condition variable is the one we've cited);

Yes, there is no language to support it, but I must admit that there is also no language to explicitly prevent it, even though I too consider this interpretation completely unreasonable as I tried to explain several times.

Anyway, this whole dispute has been reduced to the question of which threads are eligible for wakeup, so I've taken the liberty to post a clarification request to the Austin Group, asking them to add explicit text explaining which threads should be considered blocked with respect to a pthread_cond_signal() call. The clarification request is at http://austingroupbugs.net/view.php?id=609. Torvald, please correct me if have inadvertently misrepresented your position.
Comment 29 Rich Felker 2012-09-20 20:31:26 UTC
The lack of language to explicitly prevent something is not necessary. Do you see any language that explicitly prevents an implementation from writing the text "42" to stdout when you call strlen? I would grant that Torvald has an argument if the application had called any glibc functions not specified by POSIX (which could be defined by the implementation to do all sorts of crazy things) or if the application had passed constants other than those defined by POSIX to standard functions (e.g. a special attribute for the condition variable). But in the absence of that, no interface in the standard library can have further side effects on other interfaces/objects than what it's specified to do.
Comment 30 Torvald Riegel 2012-09-21 08:03:55 UTC
(In reply to comment #27)
> I disagree strongly that the spec even allows Torvald's interpretation.
> Torvald's claim is essentially that an implementation can consider an
> unspecified set of threads beyond those which "have blocked" per the
> specification of pthread_cond_wait to also "have blocked" on the condition.

No it's not unspecified. The specification requires a lower bound on this set. If you think that this is not just a lower bound, argue against the reasons I gave for this, don't just repeat your opinion here, please.

> Not
> only is there no language in the standard to support this (the only definition
> of "blocked" on a condition variable is the one we've cited);

Which is the lower bound. And this is a meaningful requirement, and provides cond vars that are useful.

> it would also
> make the specification of pthread_cond_signal completely useless, as the "at
> least one" could always be taken to mean "the one invisible thread your
> application can't see that's always blocked on every possible condition
> variable".

You do see that there is a difference between this and requiring "blocked" but not requiring an upper bound based on happens before?

> This is the pinnacle of absurdity in an attempt to language-lawyer
> out of fixing a bug.

Frankly, I don't care what you think that my motivations are. And I also don't speculate here about your motivations, nor about your abilities to distinguish an implication from equivalence. So please stop making such statements.

What I care about is whether this is indeed a bug or not.  Taking this question to the standards body for clarification is the right way to go if people think that the standard should require something else, or be more precise is what is allowed or not.

> The fact of the matter is that POSIX, along with common sense and the entire
> intended usage of condition variables, requires pthread_cond_signal to unblock
> at least one thread that "has blocked", in the sense of the blocking that
> happens atomically with unlocking the mutex as described in the specification
> of pthread_cond_wait.

That doesn't conflict with what I said. You mention an entirely different point here.

> The situation we're looking at here is that the authors of NPTL came up with a
> clever hack to reduce spurious wakeups in pthread_cond_wait, but failed to
> realize the hack was broken in some corner cases, and now Torvald is unwilling
> to admit that the hack is broken and take it out.

See above. Aside from this being impolite, speculation like this does not belong in a bug report.
Comment 31 Rich Felker 2012-09-21 08:52:51 UTC
Such "speculation" does belong in a bug report so that there is public record of this behavior of trying to weasel out of conformance requirements. Public opinion counts, and there should be a cost in public opinion to software maintainers who refuse to fix bugs and try to argue that their bugs are not bugs.

Certainly I cannot read your mind to prove claims about your motivation. It really just boils down to a case of Occam's razor. When fixing the issue would basically amount to an all-minuses patch that greatly simplifies the code, yet the maintainers refuse to do it, the simplest explanation is that somebody has an attachment to the code and the work that went into writing it. If your motivation is something else, like perhaps concerns about making a mistake and breaking something in the process of fixing it, why not say that outright rather than leaving everybody stuck having to speculate about your motivations? Then the pros and cons of fixing it can be argued rationally. Or do you claim you have no further motivation here than "I believe the standard allows low-quality implementations with bad properties like this, so I want to exercise my right to do a bad job and still claim conformance"?

With the motivation topic out of the way, let's get back to the requirements of the standard. My understanding is that you claim "the threads that are blocked on the specified condition variable cond" is a set consisting of at least those threads which provably (by virtue of the mutex providing ordering) called pthread_cond_[timed]wait before the caller of pthread_cond_signal obtained the mutex, but which also may contain other threads. Not, as I originally accused you of, a completely arbitrary set of other threads, but rather threads which are "about to" wait on the condition variable in the immediate future after the signaling thread unlocks the mutex. Is this a correct assessment of your position?

If so, can you clarify what condition would qualify such threads for membership in this set? Certainly it can't be any thread that could ever conceivably wait on the condition variable at any later point in the program flow; if nothing else, that set is self-referential and thus not even a set (because which threads are candidates for membership may depend on which thread the signal unblocked).

My claim that the set of candidate threads pthread_cond_signal can unblock is a fixed set is based on the following:

1. The status of being blocked on a condition variable is defined only in the specification of pthread_cond_[timed]wait. I think we both agree on this. There is no way a thread is ever blocked on a condition variable except by calling pthread_cond_[timed]wait on it.

2. Sequencing of events between threads is not defined in general, but the use of mutexes with condition variables imposes a sequencing. In particular, from the point of view of any other thread, a given thread's status as being blocked on a condition variable is acquired after it obtains the mutex and before pthread_cond_[timed]wait releases the mutex. The former sequencing requirement comes from the fact that you can only call pthread_cond_[timed]wait with the mutex held, and the fact that pthread_cond_[timed]wait is the function specified to block; the latter sequencing requirement is the atomicity of blocking and unlocking the mutex.

3. The pthread_cond_signal function "shall unblock at least one of the threads that are blocked on the specified condition variable". There is no language about queuing an unblock request; the language of the standard is "shall unblock". This means that even if the mechanism is some sort of queued request, in the sense of the formal, abstract machine, one of the threads in the set of blocked threads goes from blocked status to unblocked status during the call to pthread_cond_signal (and, from the stantpoint of observable sequencing, between the calling thread's calls to pthread_mutex_lock and pthread_mutex_unlock). Since the mutex is locked during this time, there is no way additional threads can become blocked on the condition variable. On the other hand, some could become unblocked due to spurious wakes, so the set could shrink, but it could not grow.

If you still claim my reasoning is wrong, please cite the specific steps you believe are hand-waving or misinterpretation of the standard.
Comment 32 Torvald Riegel 2012-09-21 15:44:30 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > I disagree strongly that the spec even allows Torvald's interpretation.
> > Torvald's claim is essentially that an implementation can consider an
> > unspecified set of threads beyond those which "have blocked" per the
> > specification of pthread_cond_wait to also "have blocked" on the condition.
> 
> Yes, that's what he claims.

Not quite, as I point out in comment #30.

> Anyway, this whole dispute has been reduced to the question of which threads
> are eligible for wakeup, so I've taken the liberty to post a clarification
> request to the Austin Group, asking them to add explicit text explaining which
> threads should be considered blocked with respect to a pthread_cond_signal()
> call. The clarification request is at
> http://austingroupbugs.net/view.php?id=609. Torvald, please correct me if have
> inadvertently misrepresented your position.

Thanks.  I have added a note there that summarizes the interpretation that the current implementation is based on.  Let's see what they think about this issue.
Comment 33 Mihail Mihaylov 2012-10-18 06:26:22 UTC
(In reply to comment #32)
> (In reply to comment #28)
> ....
> ....
> > Anyway, this whole dispute has been reduced to the question of which threads
> > are eligible for wakeup, so I've taken the liberty to post a clarification
> > request to the Austin Group, asking them to add explicit text explaining which
> > threads should be considered blocked with respect to a pthread_cond_signal()
> > call. The clarification request is at
> > http://austingroupbugs.net/view.php?id=609. Torvald, please correct me if have
> > inadvertently misrepresented your position.
> 
> Thanks.  I have added a note there that summarizes the interpretation that the
> current implementation is based on.  Let's see what they think about this
> issue.

The Austin Group have reached an official position. They have decided to make changes to some of the texts related to condition variables. I believe that the changes as they announced them yesterday invalidate glibc's interpretation of the spec.

Let me point out that these changes do not add new requirements to the spec. They just make explicit the requirements that were already suggested by the spec.

In my opinion, at this point it is already clear that NPTL's implementation of condition variables does not conform to the POSIX spec, therefore this bug is actually really a bug. I hope that the NPTL team will acknowledge it as such.
Comment 34 Rich Felker 2012-10-18 12:25:16 UTC
It should be noted that no changes were made to the requirements the standard places on implementations; the changes made are only clarifications, since apparently the original language was not clear enough.
Comment 35 Torvald Riegel 2012-10-24 20:25:32 UTC
(In reply to comment #33)
> The Austin Group have reached an official position. They have decided to make
> changes to some of the texts related to condition variables. I believe that the
> changes as they announced them yesterday invalidate glibc's interpretation of
> the spec.

I agree.  Those changes disallow glibc's current behavior.

> Let me point out that these changes do not add new requirements to the spec.
> They just make explicit the requirements that were already suggested by the
> spec.

I disagree.  It's a specification -- it has to be explicit.

(In reply to comment #34)
> It should be noted that no changes were made to the requirements the standard
> places on implementations; the changes made are only clarifications, since
> apparently the original language was not clear enough.

How is that not a change in the spec: http://austingroupbugs.net/view.php?id=609#c1403 ?  Are we talking about the same thing here?

Apparently, the original language was not clear enough.  Otherwise, why change the spec?  If it were clear enough, an interpretation explanation or something like that would have been sufficient, right.  They even say that they see the need to "produce some interpretation text to precede these changes".
Comment 36 Rich Felker 2012-10-25 04:07:39 UTC
The note #0001403 you cited begins:

"In the Oct 17th conference call we agreed to make the following changes. Further work is needed to produce some interpretation text to precede these changes."

Said "interpretation text to precede these changes", as I understand it, is an explanation of how the current text is meant to be interpreted. The "following changes" then are clarifications to appear in the TC or next version of the standard, but the interpretation is (or will be) that the current standard already requires exactly the same thing, at least from an observable-behavior standpoint.

As I have stated before, I do not see any way one could interpret the current glibc behavior as conformant. Under the current glibc behavior, events that should be synchronized by a mutex observably/provably occur in an order different from the order the synchronization imposes.
Comment 37 Jackie Rosen 2014-02-16 17:45:38 UTC Comment hidden (spam)
Comment 38 Rich Felker 2014-08-16 22:19:20 UTC
I just read the source for pthread_cond_wait and the cause of the bug is obvious: the mutex is being unlocked before the current thread acquires the internal lock in the cond var and adds itself as a waiter. Simply moving the mutex unlock to somewhere after this point (e.g. right before the futex wait loop) should fix the problem. This is necessary to meet the requirement that formally becoming a waiter is atomic with unlocking of the mutex, rather than happening at some time after unlocking the mutex with a race window in between.

If someone else could test this fix, I would appreciate it since I don't have an environment setup for building and testing new glibc builds.
Comment 39 Torvald Riegel 2014-08-18 21:22:47 UTC
That's not sufficient.  Or maybe I'm not seeing the obvious thing here -- but if it's obvious, could you give a proof why what you described is sufficient?  (And we don't want ABA issues also...)
Comment 40 Rich Felker 2014-08-18 21:42:51 UTC
Indeed, I think I was wrong on this, and probably just misread the code. Sorry for the noise.
Comment 41 Sourceware Commits 2016-12-31 13:58:05 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  ed19993b5b0d05d62cc883571519a67dae481a14 (commit)
      from  c0ff3befa9861171498dd29666d32899e9d8145b (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed19993b5b0d05d62cc883571519a67dae481a14

commit ed19993b5b0d05d62cc883571519a67dae481a14
Author: Torvald Riegel <triegel@redhat.com>
Date:   Wed May 25 23:43:36 2016 +0200

    New condvar implementation that provides stronger ordering guarantees.
    
    This is a new implementation for condition variables, required
    after http://austingroupbugs.net/view.php?id=609 to fix bug 13165.  In
    essence, we need to be stricter in which waiters a signal or broadcast
    is required to wake up; this couldn't be solved using the old algorithm.
    ISO C++ made a similar clarification, so this also fixes a bug in
    current libstdc++, for example.
    
    We can't use the old algorithm anymore because futexes do not guarantee
    to wake in FIFO order.  Thus, when we wake, we can't simply let any
    waiter grab a signal, but we need to ensure that one of the waiters
    happening before the signal is woken up.  This is something the previous
    algorithm violated (see bug 13165).
    
    There's another issue specific to condvars: ABA issues on the underlying
    futexes.  Unlike mutexes that have just three states, or semaphores that
    have no tokens or a limited number of them, the state of a condvar is
    the *order* of the waiters.  A waiter on a semaphore can grab a token
    whenever one is available; a condvar waiter must only consume a signal
    if it is eligible to do so as determined by the relative order of the
    waiter and the signal.
    Therefore, this new algorithm maintains two groups of waiters: Those
    eligible to consume signals (G1), and those that have to wait until
    previous waiters have consumed signals (G2).  Once G1 is empty, G2
    becomes the new G1.  64b counters are used to avoid ABA issues.
    
    This condvar doesn't yet use a requeue optimization (ie, on a broadcast,
    waking just one thread and requeueing all others on the futex of the
    mutex supplied by the program).  I don't think doing the requeue is
    necessarily the right approach (but I haven't done real measurements
    yet):
    * If a program expects to wake many threads at the same time and make
    that scalable, a condvar isn't great anyway because of how it requires
    waiters to operate mutually exclusive (due to the mutex usage).  Thus, a
    thundering herd problem is a scalability problem with or without the
    optimization.  Using something like a semaphore might be more
    appropriate in such a case.
    * The scalability problem is actually at the mutex side; the condvar
    could help (and it tries to with the requeue optimization), but it
    should be the mutex who decides how that is done, and whether it is done
    at all.
    * Forcing all but one waiter into the kernel-side wait queue of the
    mutex prevents/avoids the use of lock elision on the mutex.  Thus, it
    prevents the only cure against the underlying scalability problem
    inherent to condvars.
    * If condvars use short critical sections (ie, hold the mutex just to
    check a binary flag or such), which they should do ideally, then forcing
    all those waiter to proceed serially with kernel-based hand-off (ie,
    futex ops in the mutex' contended state, via the futex wait queues) will
    be less efficient than just letting a scalable mutex implementation take
    care of it.  Our current mutex impl doesn't employ spinning at all, but
    if critical sections are short, spinning can be much better.
    * Doing the requeue stuff requires all waiters to always drive the mutex
    into the contended state.  This leads to each waiter having to call
    futex_wake after lock release, even if this wouldn't be necessary.
    
    	[BZ #13165]
    	* nptl/pthread_cond_broadcast.c (__pthread_cond_broadcast): Rewrite to
    	use new algorithm.
    	* nptl/pthread_cond_destroy.c (__pthread_cond_destroy): Likewise.
    	* nptl/pthread_cond_init.c (__pthread_cond_init): Likewise.
    	* nptl/pthread_cond_signal.c (__pthread_cond_signal): Likewise.
    	* nptl/pthread_cond_wait.c (__pthread_cond_wait): Likewise.
    	(__pthread_cond_timedwait): Move here from pthread_cond_timedwait.c.
    	(__condvar_confirm_wakeup, __condvar_cancel_waiting,
    	__condvar_cleanup_waiting, __condvar_dec_grefs,
    	__pthread_cond_wait_common): New.
    	(__condvar_cleanup): Remove.
    	* npt/pthread_condattr_getclock.c (pthread_condattr_getclock): Adapt.
    	* npt/pthread_condattr_setclock.c (pthread_condattr_setclock):
    	Likewise.
    	* npt/pthread_condattr_getpshared.c (pthread_condattr_getpshared):
    	Likewise.
    	* npt/pthread_condattr_init.c (pthread_condattr_init): Likewise.
    	* nptl/tst-cond1.c: Add comment.
    	* nptl/tst-cond20.c (do_test): Adapt.
    	* nptl/tst-cond22.c (do_test): Likewise.
    	* sysdeps/aarch64/nptl/bits/pthreadtypes.h (pthread_cond_t): Adapt
    	structure.
    	* sysdeps/arm/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/ia64/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/m68k/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/microblaze/nptl/bits/pthreadtypes.h (pthread_cond_t):
    	Likewise.
    	* sysdeps/mips/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/nios2/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/s390/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/sh/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/tile/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h (pthread_cond_t):
    	Likewise.
    	* sysdeps/unix/sysv/linux/powerpc/bits/pthreadtypes.h (pthread_cond_t):
    	Likewise.
    	* sysdeps/x86/bits/pthreadtypes.h (pthread_cond_t): Likewise.
    	* sysdeps/nptl/internaltypes.h (COND_NWAITERS_SHIFT): Remove.
    	(COND_CLOCK_BITS): Adapt.
    	* sysdeps/nptl/pthread.h (PTHREAD_COND_INITIALIZER): Adapt.
    	* nptl/pthreadP.h (__PTHREAD_COND_CLOCK_MONOTONIC_MASK,
    	__PTHREAD_COND_SHARED_MASK): New.
    	* nptl/nptl-printers.py (CLOCK_IDS): Remove.
    	(ConditionVariablePrinter, ConditionVariableAttributesPrinter): Adapt.
    	* nptl/nptl_lock_constants.pysym: Adapt.
    	* nptl/test-cond-printers.py: Adapt.
    	* sysdeps/unix/sysv/linux/hppa/internaltypes.h (cond_compat_clear,
    	cond_compat_check_and_clear): Adapt.
    	* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Remove file ...
    	* sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c
    	(__pthread_cond_timedwait): ... and move here.
    	* nptl/DESIGN-condvar.txt: Remove file.
    	* nptl/lowlevelcond.sym: Likewise.
    	* nptl/pthread_cond_timedwait.c: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_broadcast.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_signal.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_timedwait.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_broadcast.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_signal.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_timedwait.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_wait.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_broadcast.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_signal.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S: Likewise.
    	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_wait.S: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Likewise.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                                          |   74 ++
 nptl/DESIGN-condvar.txt                            |  134 ---
 nptl/Makefile                                      |    6 +-
 nptl/lowlevelcond.sym                              |   16 -
 nptl/nptl-printers.py                              |   70 +--
 nptl/nptl_lock_constants.pysym                     |   27 +-
 nptl/pthreadP.h                                    |    7 +
 nptl/pthread_cond_broadcast.c                      |   99 +-
 nptl/pthread_cond_common.c                         |  466 ++++++++++
 nptl/pthread_cond_destroy.c                        |   82 +--
 nptl/pthread_cond_init.c                           |   28 +-
 nptl/pthread_cond_signal.c                         |   99 ++-
 nptl/pthread_cond_timedwait.c                      |  268 ------
 nptl/pthread_cond_wait.c                           |  754 ++++++++++++----
 nptl/pthread_condattr_getclock.c                   |    2 +-
 nptl/pthread_condattr_getpshared.c                 |    3 +-
 nptl/pthread_condattr_init.c                       |    4 +-
 nptl/pthread_condattr_setclock.c                   |   11 +-
 nptl/test-cond-printers.py                         |    2 +-
 nptl/tst-cond1.c                                   |    3 +
 nptl/tst-cond20.c                                  |    5 +-
 nptl/tst-cond22.c                                  |   18 +-
 sysdeps/aarch64/nptl/bits/pthreadtypes.h           |   31 +-
 sysdeps/arm/nptl/bits/pthreadtypes.h               |   29 +-
 sysdeps/ia64/nptl/bits/pthreadtypes.h              |   31 +-
 sysdeps/m68k/nptl/bits/pthreadtypes.h              |   32 +-
 sysdeps/microblaze/nptl/bits/pthreadtypes.h        |   29 +-
 sysdeps/mips/nptl/bits/pthreadtypes.h              |   31 +-
 sysdeps/nios2/nptl/bits/pthreadtypes.h             |   31 +-
 sysdeps/nptl/internaltypes.h                       |   17 +-
 sysdeps/nptl/pthread.h                             |    2 +-
 sysdeps/s390/nptl/bits/pthreadtypes.h              |   29 +-
 sysdeps/sh/nptl/bits/pthreadtypes.h                |   29 +-
 sysdeps/tile/nptl/bits/pthreadtypes.h              |   29 +-
 sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h  |   31 +-
 sysdeps/unix/sysv/linux/hppa/internaltypes.h       |   40 +-
 .../unix/sysv/linux/hppa/pthread_cond_timedwait.c  |   41 -
 sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c   |   13 +
 .../sysv/linux/i386/i686/pthread_cond_timedwait.S  |   20 -
 .../unix/sysv/linux/i386/pthread_cond_broadcast.S  |  241 -----
 sysdeps/unix/sysv/linux/i386/pthread_cond_signal.S |  216 -----
 .../unix/sysv/linux/i386/pthread_cond_timedwait.S  |  974 --------------------
 sysdeps/unix/sysv/linux/i386/pthread_cond_wait.S   |  642 -------------
 .../unix/sysv/linux/powerpc/bits/pthreadtypes.h    |   31 +-
 .../sysv/linux/x86_64/pthread_cond_broadcast.S     |  177 ----
 .../unix/sysv/linux/x86_64/pthread_cond_signal.S   |  161 ----
 .../sysv/linux/x86_64/pthread_cond_timedwait.S     |  623 -------------
 sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S |  555 -----------
 sysdeps/x86/bits/pthreadtypes.h                    |   29 +-
 49 files changed, 1671 insertions(+), 4621 deletions(-)
 delete mode 100644 nptl/DESIGN-condvar.txt
 delete mode 100644 nptl/lowlevelcond.sym
 create mode 100644 nptl/pthread_cond_common.c
 delete mode 100644 nptl/pthread_cond_timedwait.c
 delete mode 100644 sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c
 delete mode 100644 sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S
 delete mode 100644 sysdeps/unix/sysv/linux/i386/pthread_cond_broadcast.S
 delete mode 100644 sysdeps/unix/sysv/linux/i386/pthread_cond_signal.S
 delete mode 100644 sysdeps/unix/sysv/linux/i386/pthread_cond_timedwait.S
 delete mode 100644 sysdeps/unix/sysv/linux/i386/pthread_cond_wait.S
 delete mode 100644 sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S
 delete mode 100644 sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S
 delete mode 100644 sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
 delete mode 100644 sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
Comment 42 jsm-csl@polyomino.org.uk 2016-12-31 17:43:41 UTC
If the commit fixes the bug, please mark it as FIXED with milestone set to 
2.25.
Comment 43 Torvald Riegel 2017-01-01 21:32:09 UTC
Fixed in master.