Bug 13690

Summary: pthread_mutex_unlock potentially cause invalid access
Product: glibc Reporter: Atsushi Nemoto <anemo>
Component: nptlAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: bugdal, carlos, coolhair24, dancol, drepper.fsp, fweimer, jakub, kevin.dempsey, lopresti, mtk.manpages, neleai, ppluzhnikov, triegel, tudorb
Priority: P2 Flags: carlos: review? (jakub)
Version: 2.15   
Target Milestone: 2.23   
Host: Target:
Build: Last reconfirmed:
Attachments: a patch to fix only lll_unlock,lll_robust_unlock call in pthread_mutex_unlock
demonstration of spurious futex wakes from kernel

Description Atsushi Nemoto 2012-02-14 14:27:49 UTC
It seems pthread_mutex_unlock() potentially cause invalid access on
most platforms (except for i386 and x86_64).

In nptl/pthread_mutex_unlock.c, lll_unlock() is called like this:
      lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));

And PTHREAD_MUTEX_PSHARED() is defined like this:
# define PTHREAD_MUTEX_PSHARED(m) \
  ((m)->__data.__kind & 128)

On most platforms, lll_unlock() is defined as a macro like this:
#define lll_unlock(lock, private) \
  ((void) ({						      \
    int *__futex = &(lock);				      \
    int __val = atomic_exchange_rel (__futex, 0);	      \
    if (__builtin_expect (__val > 1, 0))		      \
      lll_futex_wake (__futex, 1, private);		      \
  }))

Thus, the lll_unlock() call in pthread_mutex_unlock.c will be expanded as:
    int *__futex = &(mutex->__data.__lock);
    int __val = atomic_exchange_rel (__futex, 0);
    if (__builtin_expect (__val > 1, 0))		/* A */
      lll_futex_wake (__futex, 1, ((mutex)->__data.__kind & 128)); /* B */

On point "A", the mutex is actually unlocked, so other threads can
lock the mutex, unlock, destroy and free.  If the mutex was destroyed
and freed by other thread, reading '__kind' on point "B" is not valid.

This can happen with this example in pthread_mutex_destroy manual.

http://pubs.opengroup.org/onlinepubs/007904875/functions/pthread_mutex_destroy.html
------------------------------------------------------------------------
    Destroying Mutexes

    A mutex can be destroyed immediately after it is unlocked. For
    example, consider the following code:

    struct obj {
    pthread_mutex_t om;
	int refcnt;
	...
    };

    obj_done(struct obj *op)
    {
	pthread_mutex_lock(&op->om);
	if (--op->refcnt == 0) {
	    pthread_mutex_unlock(&op->om);
    (A)     pthread_mutex_destroy(&op->om);
    (B)     free(op);
	} else
    (C)     pthread_mutex_unlock(&op->om);
    }

    In this case obj is reference counted and obj_done() is called
    whenever a reference to the object is dropped. Implementations are
    required to allow an object to be destroyed and freed and potentially
    unmapped (for example, lines A and B) immediately after the object is
    unlocked (line C).
------------------------------------------------------------------------

In this example, (A) and (B) can be executed in middle of (C) execution.

It can happen in this way (explanation by KOSAKI-san):
1) Thread-1) atomic_exchange_rel(0)
2) preempt
3) Thread-2) call mutex_lock(). (ok, it's success)
4) Thread-2) call mutex_unlock()
5) Thread-2) call mutex_destroy()
6) Thread-2) free(mutex)
7) preempt
8) Thread-3)  reuse memory of the mutex
9) preempt
10) Thread-1) dereference (mutex)->__data__.__kind

Copying __kind to a local variable before atomic_exchange_rel
will fix this.
Comment 1 Carlos O'Donell 2012-02-14 15:41:13 UTC
Nemoto-san,

Do you think you could come up with a patch to fix this?

I think we need to adjust ntpl/pthread_mutex_unlock.c to pass down private as a copy of a local variable.
Comment 2 Atsushi Nemoto 2012-02-15 13:17:17 UTC
(In reply to comment #1)
> Do you think you could come up with a patch to fix this?
> 
> I think we need to adjust ntpl/pthread_mutex_unlock.c to pass down private as a
> copy of a local variable.

Though fixing just one lll_unlock call in pthread_mutex_unlock.c seems so easy,
I wonder there might be other places to fix, but not sure.

For example:
* lll_futex_wake call in __pthread_mutex_unlock_full (PP mutex case)
* lll_unlock call in __pthread_rwlock_unlock

So I hope nptl experts fix this properly.
Thank you.
Comment 3 Carlos O'Donell 2012-02-15 14:33:47 UTC
(In reply to comment #2)
> For example:
> * lll_futex_wake call in __pthread_mutex_unlock_full (PP mutex case)
> * lll_unlock call in __pthread_rwlock_unlock
> 
> So I hope nptl experts fix this properly.
> Thank you.

Start slowly. Fix the first known problem. Then move on to the next.

Do you think you could come up with a patch to *only* fix the lll_unlock usage problem?
Comment 4 Rich Felker 2012-02-16 05:07:32 UTC
Analogous bugs are ENDEMIC in glibc/NPTL and so far there's been a complete unwillingless to fix them or even acknowledge that they exist. See http://sourceware.org/bugzilla/show_bug.cgi?id=12674

Fixing the issue to ensure that a synchronization object's memory is not touched whatsoever after it's unlocked/posted is non-trivial, but once you figure out the solution, it's rather general-purpose. A while back I audited all my synchronization primitives in musl libc for similar bugs and fixed them, so it might be a useful source for ideas to fix glibc/NPTL.
Comment 5 Atsushi Nemoto 2012-02-16 14:39:24 UTC
Created attachment 6222 [details]
a patch to fix only lll_unlock,lll_robust_unlock call in pthread_mutex_unlock
Comment 6 Atsushi Nemoto 2012-02-16 14:45:15 UTC
(In reply to comment #3)
> Start slowly. Fix the first known problem. Then move on to the next.
> 
> Do you think you could come up with a patch to *only* fix the lll_unlock usage
> problem?

OK, I just have uploaded a patch with obvious fixes only.
Comment 7 Carlos O'Donell 2012-02-16 15:26:24 UTC
(In reply to comment #4)
> Analogous bugs are ENDEMIC in glibc/NPTL and so far there's been a complete
> unwillingless to fix them or even acknowledge that they exist. See
> http://sourceware.org/bugzilla/show_bug.cgi?id=12674
> 
> Fixing the issue to ensure that a synchronization object's memory is not
> touched whatsoever after it's unlocked/posted is non-trivial, but once you
> figure out the solution, it's rather general-purpose. A while back I audited
> all my synchronization primitives in musl libc for similar bugs and fixed them,
> so it might be a useful source for ideas to fix glibc/NPTL.

I've assigned myself to BZ#12674 and I'll review the issue. Thank you for bringing up the issue.
Comment 8 Carlos O'Donell 2012-02-16 15:32:46 UTC
(In reply to comment #6)
> (In reply to comment #3)
> > Start slowly. Fix the first known problem. Then move on to the next.
> > 
> > Do you think you could come up with a patch to *only* fix the lll_unlock usage
> > problem?
> 
> OK, I just have uploaded a patch with obvious fixes only.

Nemoto-san,

Thank you very much for posting the patch!

What kind of testing have you done with the patch?

Do you have a small test case that can trigger the failure even sporadically?

It would be nice to get a test case added that documents the class of failure we tried to fix.
Comment 9 Rich Felker 2012-02-16 16:21:08 UTC
For all bugs like this, I suspect hitting the race condition will require running the test case for months or years. That's part of why bugs like this are so frustrating: imagine your mission-critical system crashing just a couple times a year and the crash not being reproducible. It might be easier to hit the race with some extreme usage of scheduling priorities to prevent the unlocking thread from executing for a long time.
Comment 10 Carlos O'Donell 2012-02-16 16:32:43 UTC
(In reply to comment #9)
> For all bugs like this, I suspect hitting the race condition will require
> running the test case for months or years. That's part of why bugs like this
> are so frustrating: imagine your mission-critical system crashing just a couple
> times a year and the crash not being reproducible. It might be easier to hit
> the race with some extreme usage of scheduling priorities to prevent the
> unlocking thread from executing for a long time.

I agree, but a test case need not be an exact representation of your application under test.

The trick I normally use is to preload an auditing library and use the plt_enter and plt_exit stubs to slow down a thread, widening the race window or in some cases making it a 100% reliable reproducer.

Such a trick is a perfectly acceptable way to make a test case, but might be hard to use in this situation.
Comment 11 Rich Felker 2012-02-17 05:10:16 UTC
If that's acceptable, you could just make the test case either a gdb script or a dedicated ptrace-using parent process that puts a breakpoint at the right location to hit the race...
Comment 12 Atsushi Nemoto 2012-02-17 13:26:59 UTC
(In reply to comment #8)
> What kind of testing have you done with the patch?
> 
> Do you have a small test case that can trigger the failure even sporadically?
> 
> It would be nice to get a test case added that documents the class of failure
> we tried to fix.

Unfortunately I could not reproduce the problem and do not have any test case.

Actually, this problem was discovered by an code analysis from a report that
indicates futex_wake syscall was called with a wrong private flag.

So my patch is just a theoretical fix.
Comment 13 Carlos O'Donell 2012-02-17 16:17:25 UTC
(In reply to comment #11)
> If that's acceptable, you could just make the test case either a gdb script or
> a dedicated ptrace-using parent process that puts a breakpoint at the right
> location to hit the race...

I like the idea of a ptrace-using parent process to trigger the race condition, but it's fragile. I like your suggestion though.
Comment 14 Carlos O'Donell 2012-02-17 16:36:54 UTC
Nemoto-san,

The patch looks good to me.

My only worry is that the performance of the robust unlock fast path is impacted.

I notice that PTHREAD_ROBUST_MUTEX_PSHARED always returns LLL_SHARED.

Therefore it would be optimal if instead of a temporary we just passed down LLL_SHARED (a constant).

I don't know why glibc has the PTHREAD_ROBUST_MUTEX_PSHARED macro.

The code comment says:
~~~
/* The kernel when waking robust mutexes on exit never uses
   FUTEX_PRIVATE_FLAG FUTEX_WAKE.  */
#define PTHREAD_ROBUST_MUTEX_PSHARED(m) LLL_SHARED
~~~

Which appears to imply that at some point in the future it might not always return LLL_SHARED.

Therefore your patch is correct.

Can you verify that the instruction sequences generated are identical given the optimization level -O2 used to compile glibc?
Comment 15 Atsushi Nemoto 2012-02-20 11:38:58 UTC
(In reply to comment #14)
> Can you verify that the instruction sequences generated are identical given the
> optimization level -O2 used to compile glibc?

Yes, I verified __pthread_mutex_unlock_full code sequences are identical.
Comment 16 Carlos O'Donell 2012-02-22 14:54:30 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > Can you verify that the instruction sequences generated are identical given the
> > optimization level -O2 used to compile glibc?
> 
> Yes, I verified __pthread_mutex_unlock_full code sequences are identical.

Nemoto-san,

Thank you for checking.

I've gone through the patch and I think it's good, but I'd like someone else to also review the change.

Given that Jakub commented on the patch on libc-alpha I'll ask him to review the patch attached to this issue.

Jakub,

Could you please review this patch?
Comment 17 Ulrich Drepper 2012-03-07 10:29:21 UTC
(In reply to comment #0)
> Thus, the lll_unlock() call in pthread_mutex_unlock.c will be expanded as:
>     int *__futex = &(mutex->__data.__lock);
>     int __val = atomic_exchange_rel (__futex, 0);
>     if (__builtin_expect (__val > 1, 0))        /* A */
>       lll_futex_wake (__futex, 1, ((mutex)->__data.__kind & 128)); /* B */
> 
> On point "A", the mutex is actually unlocked, so other threads can
> lock the mutex, unlock, destroy and free.  If the mutex was destroyed
> and freed by other thread, reading '__kind' on point "B" is not valid.

You read the code incorrectly.

If B is reached there must be another thread using the mutex.  It is currently waiting.  In that case it is invalid to destroy the mutex.  In any case would there be another memory access, from the thread which is woken by the lll_futex_wake call.

The same applies to whatever you try to change with your patch.

Again, as long as a thread is waiting on a mutex you cannot destroy it legally.  Show me a place where the code is accessing the futex after the unlock when there is no locker.
Comment 18 Rich Felker 2012-03-07 17:52:03 UTC
You misunderstand the race.

Suppose thread A is unlocking the mutex and gets descheduled after the atomic_exchange_rel but before the lll_futex_wake, and thread B is waiting to lock the mutex. At this point, as far thread B can observe, A is no longer a user of the mutex. Thread B obtains the mutex, performs some operations, unlocks the mutex, and assuming (correctly) that it's now the last user of the mutex, destroys it and frees the memory it occupied.

Now at some later point, thread A gets scheduled again and crashes accessing freed memory.

If you're wondering how thread B could wake up without thread A calling lll_futex_wake, here are several reasons:

1. Never going to sleep due to value mismatch on the original futex wait call.
2. Receipt of a signal, and value mismatch when the signal handler returns and futex wait is called again.
3. Spurious wakes that look like successful returns from wait. These do exist in Linux, and I have not been able to determine the reason, but I have a test program which can successfully produce them in one case (unrelated to mutexes).
Comment 19 Carlos O'Donell 2012-03-08 03:21:39 UTC
Given Ulrich's comments I'm going back to review the code.

I'd like this fixed before the 2.16 release, and therefore I'm marking this with a target milestone of 2.16.
Comment 20 Rich Felker 2012-03-08 05:12:48 UTC
Please also examine the corresponding bug for sem_post, #12674, which is exactly the same type of bug.
Comment 21 Joseph Myers 2012-06-27 22:32:24 UTC
Removing glibc_2.15 keyword as backport suitability can't be judged until there is a fix on master.  After this is fixed on master, anyone wanting a backport should feel free to attach a tested backport patch to this bug (or a new bug, if this one is closed as fixed).
Comment 22 Andreas Jaeger 2012-12-01 16:43:34 UTC
Jakub, could you review this one, please?
Comment 23 Ondrej Bilka 2013-10-09 20:14:38 UTC
ping, this was waiting almost year for second review.
Comment 24 Torvald Riegel 2013-12-18 20:13:40 UTC
I agree that after release of a mutex (i.e., atomic_exchange_rel (__futex, 0)), we have both (1) a pending load from mutex->__data.__kind and (2) a pending futex_wake call.

However, I think it is an open question whether POSIX actually allows destruction of the mutex just based on having obtained ownership of the lock.  The example given in the standard and reproduced in Comment #1 is in an informative section, and it conflicts with a statement in the normative section: "Attempting to destroy a locked mutex or a mutex that is referenced [...] by another thread results in undefined behavior."

Arguably, a mutex could still be considered "referenced" as long as a call to mutex_unlock has not yet returned.  This would make the example in the normative text incorrect.  OTOH, the intended semantics could also be that if a program ensures that (1) a thread is the last one to lock a mutex, and (2) this thread is able to lock and unlock a mutex, then this thread is also allowed to destroy the mutex; IOW, being able to doing the last lock and unlock of the mutex could be the defining constraint for when destruction is allowed.

(That is what C++11 seems to require based on a quick read.  C11 isn't very verbose but requires just that no thread is blocked on the mutex when it is destructed; nonetheless, it also mentions that all resources of the mutex are claimed, which could be understood to mean the same as the "referenced" constraint in POSIX.)

I've asked the Austin Group for clarification: http://austingroupbugs.net/view.php?id=811
Depending on how they decide, this is either not a bug, or we'll have to avoid the pending load and futex_wake call, or make them harmless.  The proposed patch should be right for the pending load, but the futex_wake needs more investigation: A futex_wake to a futex without waiters (or even to a futex not mapped anymore) should be harmless, but it could be different with PI futexes.
Comment 25 Rich Felker 2013-12-18 20:33:37 UTC
Torvald, while I agree there is *some* room for arguing about the semantics intended by the standard, the fact of the matter is that it's really hard to use mutexes that reside in dynamically-allocated memory if you can't rely on the self-synchronized destruction semantics. Some common cases are easy, for example, if you're freeing an object that's part of a larger data structure, acquiring the lock on the larger data structure before locking the individual object assures that another thread cannot still be in the tail part of the pthread_mutex_unlock call for the individual object when you obtain the mutex. But in general it's not so easy; at the very least, it requires non-trivial, non-intuitive, error-prone reasoning to determine if any particular usage is safe. And this is not good.

Fixing the bug, on the other hand, is not hard, so I think it should just be fixed, even if the Austin Group chooses to relax this requirement. It's a quality of implementation issue because, however it's resolved, many apps will continue to have this kind of race condition on implementations where self-synchronized destruction does not work, resulting in random near-impossible-to-reproduce crashes.
Comment 26 Rich Felker 2013-12-18 20:49:15 UTC
BTW I posted on the Austin Group ticket what's perhaps the most important usage case for self-synchronized destruction: refcounted objects.
Comment 27 Pat 2013-12-20 19:08:19 UTC
If you have to wait for all calls to mutex_unlock to return before you can destroy the mutex, how are you supposed to guarantee that, exactly?

You can only do so by synchronizing the threads in some other way. So every mutex has to be guarded by another mutex which has to be guarded by another mutex...

...except for the last mutex, which is global or static or whatever and never gets destroyed. Problem solved!

Seriously, think about the kind of code you would have to write to deal with these semantics. Is that what you think POSIX wanted to put people through (despite the actual example they give)? Is that what glibc wants to put people through?

You do not need "clarification" from anybody to recognize that these are serious bugs. The only interesting question is how many years it's going to take.
Comment 28 Carlos O'Donell 2013-12-20 19:38:23 UTC
(In reply to Pat from comment #27)
> If you have to wait for all calls to mutex_unlock to return before you can
> destroy the mutex, how are you supposed to guarantee that, exactly?
> 
> You can only do so by synchronizing the threads in some other way. So every
> mutex has to be guarded by another mutex which has to be guarded by another
> mutex...
> 
> ...except for the last mutex, which is global or static or whatever and
> never gets destroyed. Problem solved!
> 
> Seriously, think about the kind of code you would have to write to deal with
> these semantics. Is that what you think POSIX wanted to put people through
> (despite the actual example they give)? Is that what glibc wants to put
> people through?
> 
> You do not need "clarification" from anybody to recognize that these are
> serious bugs. The only interesting question is how many years it's going to
> take.

That may be the case for normal software projects, but this is glibc. We are a conservative project and we work through a standards process and collaborate with the Austin Group and the ISO group on POSIX and ISO C. I understand that this is sometimes frustratingly slow, but it ensures a clarity and quality that we desire to achieve with the project.

I don't disagree that it seems ridiculous to require such complexity, but we want to gather consensus from the Austin Group to ensure that we understand all the implications before we make a change.
Comment 29 Torvald Riegel 2013-12-20 20:25:14 UTC
(In reply to Pat from comment #27)
> If you have to wait for all calls to mutex_unlock to return before you can
> destroy the mutex, how are you supposed to guarantee that, exactly?
> 
> You can only do so by synchronizing the threads in some other way. So every
> mutex has to be guarded by another mutex which has to be guarded by another
> mutex...
> 
> ...except for the last mutex, which is global or static or whatever and
> never gets destroyed. Problem solved!

There are other ways to do garbage collection / end-of-lifetime-detection than doing so via mutex-protected reference counts (and I don't mean you should use garbage collection for memory management...).

If your mutexes are part of other data structures that you have to destroy, you can destroy the mutexes when destructing the data structure; unless every access to the data structure is protected by the mutex, you'll likely need another mechanism anyway to decide when it's safe to destruct.

> Seriously, think about the kind of code you would have to write to deal with
> these semantics.

I have.  This is a trade-off between implementation constraints and guarantees given to users, as I pointed out in the POSIX clarification request.  If you have anything to say about that, then please comment on the POSIX issue.

> Is that what you think POSIX wanted to put people through
> (despite the actual example they give)?

That's what the clarification request is for.

> Is that what glibc wants to put
> people through?

As Carlos said, we're approaching issues like this with the care that is needed.  This involves making sure that the standards and implementations are in sync.

> You do not need "clarification" from anybody to recognize that these are
> serious bugs. The only interesting question is how many years it's going to
> take.

It would be a bug if it is not adhering to the specification of the function.  Arguably, the normative text in the standard can be understood to not allow the use you seem to want to have.  The POSIX reply will clarify this.

If POSIX clarifies that it wants the strong guarantees re destruction, this is a bug compared to a clarified standard.  Otherwise, this bug is an enhancement request, which we would then evaluate as well.
Comment 30 Rich Felker 2013-12-20 22:51:53 UTC
Carlos, there's nothing "conservative" about giving applications broken semantics on the basis that you're not sure the standard _requires_ the correct semantics. And there's no reason to wait for a response from the Austin Group to fix this. Even if, by some fluke, they removed the requirement that ending the "reference" to a mutex is atomic with unlocking it, you would still want to have the correct, safe semantics just for the sake of applications using it. THIS is the conservative behavior.

Torvald, in regards that there are "other ways" to do end-of-lifetime detection, the only way to do so in a strictly conforming application, if you don't have the self-synchronized destruction property for at least one of mutexes, semaphores, or spinlocks, is with a _global_ lock ensuring that no two threads can be attempting to release a reference to the same type of reference-counted object at the same time. This is obviously not practical from a performance standpoint, and it's also hideous from a "global state considered harmful" standpoint. Obviously with other tools that will be available in future editions of POSIX (e.g. C11 atomics) and that are available now as extensions, you can work around the problem by refraining from using mutexes, but that's not a good solution.
Comment 31 Torvald Riegel 2014-01-06 16:58:47 UTC
(In reply to Rich Felker from comment #30)
> Carlos, there's nothing "conservative" about giving applications broken
> semantics on the basis that you're not sure the standard _requires_ the
> correct semantics. And there's no reason to wait for a response from the
> Austin Group to fix this. Even if, by some fluke, they removed the
> requirement that ending the "reference" to a mutex is atomic with unlocking
> it, you would still want to have the correct, safe semantics just for the
> sake of applications using it. THIS is the conservative behavior.

You can't claim that just one of the semantics is "correct".  They, obviously, can both be used in meaningful ways.  We can certainly argue about the utility of both of the semantics, but that's different from correctness.

Also, because you mentioned conforming applications: those won't be helped by glibc implementing something (incl. something stronger) that's not guaranteed by POSIX.

> Torvald, in regards that there are "other ways" to do end-of-lifetime
> detection, the only way to do so in a strictly conforming application, if
> you don't have the self-synchronized destruction property for at least one
> of mutexes, semaphores, or spinlocks, is with a _global_ lock ensuring that
> no two threads can be attempting to release a reference to the same type of
> reference-counted object at the same time.

No, it does not need to be a global lock.  You just make sure that all threads that use the resource you want to destruct have quiesced.  For example, if you've spawned a set of threads to do work concurrently on a resource, and join them after they've done the job, then pthread_join does exactly this for you; afterwards, whichever thread spawned those threads can initiate destruction.  If you've used a task queue or similar on a thread pool, the task queue can do the same.

> This is obviously not practical
> from a performance standpoint,

I don't see how the thread pool, for example, is bad in terms of performance.

> and it's also hideous from a "global state
> considered harmful" standpoint.

This is not about global vs. non-global state, but instead about how to ensure quiescence of concurrent threads: you initiate concurrent execution, and once that's done and there's no concurrency anymore, you destruct.  The main thread might be the thread that's doing that, and it might use global state for that, but that's not necessarily so.
Comment 32 Pat 2014-01-06 17:46:11 UTC
(In reply to Torvald Riegel from comment #31)
> 
> You can't claim that just one of the semantics is "correct".

Um, actually, you can... Based on (a) the only sane interpretation and (b) THE ACTUAL EXAMPLE USAGE GIVEN IN THE POSIX SPEC.

> Also, because you mentioned conforming applications: those won't be helped
> by glibc implementing something (incl. something stronger) that's not
> guaranteed by POSIX.

"Hey, POSIX authors, did you actually mean the example code you gave in the, you know, spec?"

"Yes, we actually meant it."

Is that what you are waiting to hear before you fix this bug? Seriously?

Most people writing "conforming applications" are going to expect the examples in the spec to... let's see... I dunno... work?

> No, it does not need to be a global lock.  You just make sure that all
> threads that use the resource you want to destruct have quiesced.

To "make sure that all threads ... have quiesced", you must do one of two things: (a) Rely on some synchronization mechanism, all of which are currently broken due to this bug; or (b) wait for all threads to exit.

You are arguing for (b): To destroy a mutex -- any mutex -- you must first wait for every thread that ever touched that mutex to exit.

Is it possible to code against these semantics? Of course. It is also possible to code without using threads. Or function calls. Or multiplication.

But the whole point of a primitive is to provide useful semantics across a variety of applications. And by "a variety", I mean more than one (1).

There is one nice thing about this bug, though. It provides a quintessential (and hilarious) example of why people laugh and roll their eyes when they hear the phrase "glibc maintainer".

I wonder, what's the over/under on whether this bug gets fixed before 2017?
Comment 33 Torvald Riegel 2014-01-06 20:38:37 UTC
(In reply to Pat from comment #32)
> (In reply to Torvald Riegel from comment #31)
> > 
> > You can't claim that just one of the semantics is "correct".
> 
> Um, actually, you can... Based on (a) the only sane interpretation and (b)
> THE ACTUAL EXAMPLE USAGE GIVEN IN THE POSIX SPEC.

You're still saying that you like one semantics better than the other.  That doesn't make another *semantics* incorrect.

> > Also, because you mentioned conforming applications: those won't be helped
> > by glibc implementing something (incl. something stronger) that's not
> > guaranteed by POSIX.
> 
> "Hey, POSIX authors, did you actually mean the example code you gave in the,
> you know, spec?"
> 
> "Yes, we actually meant it."
> 
> Is that what you are waiting to hear before you fix this bug? Seriously?

If you want to know what I'd like to get clarified by the Austin Group, please read the Austin group issue.  It should be easy to understand.

> Most people writing "conforming applications" are going to expect the
> examples in the spec to... let's see... I dunno... work?

You can say the same thing about the normative text.  Which brings us right back to the clarification request...

> > No, it does not need to be a global lock.  You just make sure that all
> > threads that use the resource you want to destruct have quiesced.
> 
> To "make sure that all threads ... have quiesced", you must do one of two
> things: (a) Rely on some synchronization mechanism, all of which are
> currently broken due to this bug;

No, they aren't "broken".  See the examples I gave.

> or (b) wait for all threads to exit.

Precisely, wait for all threads that use the particular resource to not use it anymore.  That's different from "wait[ing] for all threads to exit".

> You are arguing for (b): To destroy a mutex -- any mutex -- you must first
> wait for every thread that ever touched that mutex to exit.

This could be a reasonable semantics.

> Is it possible to code against these semantics? Of course.

And often, programs will do just that.  All that don't do reference counting or similar, for example.

> It is also
> possible to code without using threads. Or function calls. Or multiplication.
> 
> But the whole point of a primitive is to provide useful semantics across a
> variety of applications. And by "a variety", I mean more than one (1).
> 
> There is one nice thing about this bug, though. It provides a quintessential
> (and hilarious) example of why people laugh and roll their eyes when they
> hear the phrase "glibc maintainer".
> 
> I wonder, what's the over/under on whether this bug gets fixed before 2017?

This bug is about a technical issue.  I'm not going to respond to off-topic statements like this.
Comment 34 Rich Felker 2014-01-06 20:47:25 UTC
> > or (b) wait for all threads to exit.
> 
> Precisely, wait for all threads that use the particular resource to not use it > anymore.  That's different from "wait[ing] for all threads to exit".

That requires a mutex. So you just moved the mutex issue to a different mutex; you didn't solve it.

> > You are arguing for (b): To destroy a mutex -- any mutex -- you must first
> > wait for every thread that ever touched that mutex to exit.
> 
> This could be a reasonable semantics.

No, you stopped being reasonable several posts back in this thread. I'm not sure what your emotional attachment to glibc's current brokenness with respect to this issue is, but it's completely clouding your judgement and making you look like a fool. It's really sad to see this kind of response to bugs again when glibc was just recovering from the madness of the former maintainer and his attitude towards bug reports...
Comment 35 Torvald Riegel 2014-01-06 21:20:03 UTC
(In reply to Rich Felker from comment #34)
> > > or (b) wait for all threads to exit.
> > 
> > Precisely, wait for all threads that use the particular resource to not use it > anymore.  That's different from "wait[ing] for all threads to exit".
> 
> That requires a mutex. So you just moved the mutex issue to a different
> mutex; you didn't solve it.

Consider a thread pool, or something else that manages a set of threads.  The lifetime of this thing will be larger than the lifetime of the threads themselves.  You will want to do pthread_join on the threads eventually, if you're interested in safe destruction.  Once you do that, you can safely destruct the thread pool at that time, including any mutexes in it.  If you want to keep the threads around for longer (e.g., so that they can work on more than one task), you can easily let them signal the thread pool once they've finished the task.  For that, you can use a mutex in the thread pool for example.

Thus, there is a straightforward way to do it without reference counting.  Having a mutex or similar on the thread pool is not something that's bad.  You will have the thread pool (or a pthread_t at the very least anyway).

If we didn't have pthread_join, or one would have to implement its functionality with a pthread_mutex_t, then we would have a problem.  But that's not the case, we do have pthread_join() to eventually break the chain you seem to be concerned about.

> > > You are arguing for (b): To destroy a mutex -- any mutex -- you must first
> > > wait for every thread that ever touched that mutex to exit.
> > 
> > This could be a reasonable semantics.
> 
> No, you stopped being reasonable several posts back in this thread. I'm not
> sure what your emotional attachment to glibc's current brokenness with
> respect to this issue is, but it's completely clouding your judgement and
> making you look like a fool. It's really sad to see this kind of response to
> bugs again when glibc was just recovering from the madness of the former
> maintainer and his attitude towards bug reports...

I have no emotional attachment to anything here.  That includes the stronger semantics you want to have, your assumptions about my judgement, etc.

You haven't made a convincing argument why the semantics as targeted by the current glibc implementation would be incorrect (if the issue above is what worries you, let's keep discussing that one).  I understood that you'd want something stronger, and I appreciate that you have an opinion on this, but ultimately I think glibc should implement what POSIX wants, thus the clarification request.

Also, if you (and Pat) want the glibc community to be a place where technical issues are solved in a constructive manner, then you should probably remind yourselves that you and your actions are very much a part of this.
Comment 36 Rich Felker 2014-01-06 21:24:27 UTC
My comment about the previous maintainership had nothing to do with "constructive manner" (whatever that is supposed to mean) but the attitude of making excuses and arguments against fixing bugs instead of just fixing them.
Comment 37 Kevin Dempsey 2014-06-20 12:23:32 UTC
Now that the austin group have clarified the expected behaviour (http://austingroupbugs.net/view.php?id=811) can progress be made on fixing this?
Comment 38 Torvald Riegel 2014-06-20 18:28:59 UTC
We already started looking at the implications and into possible fixes.  The any accesses to the mutex' memory are the easy part of the problem of this.  The fact that the futex_wake call can now hit reused memory (i.e., after destruction of the mutex) is the trickier issue IMO.  More details on the latter can be found in an email I sent a while ago to libc-alpha.
Comment 39 Rich Felker 2014-06-20 19:02:27 UTC
On Fri, Jun 20, 2014 at 06:28:59PM +0000, triegel at redhat dot com wrote:
> The fact that the futex_wake call can now hit reused memory (i.e., after
> destruction of the mutex) is the trickier issue IMO.  More details on the
> latter can be found in an email I sent a while ago to libc-alpha.

I'm not convinced that the resulting spurious futex wakes are a
serious problem. As I've mentioned before, I have observed spurious
futex wakes coming from the kernel, so applications need to be
prepared to deal with them anyway. I'll see if I can dig up and post
my test case demonstrating them.
Comment 40 Rich Felker 2014-06-20 19:10:05 UTC
Created attachment 7653 [details]
demonstration of spurious futex wakes from kernel

The attached file demonstrates spurious futex wakes coming from the kernel; it was designed to mimic the situation I was experiencing where they were breaking my implementation of pthread_join when it failed to retry the wait. I'm not sure if they happen or not in situations not connected to CLONE_CHILD_CLEARTID. I experienced the issue on 3.2 and 3.5 kernels (the latter re-checked just now) but have not been able to reproduce it on 3.15.
Comment 41 Rich Felker 2014-06-23 03:06:39 UTC
By the way I just ran across FUTEX_WAKE_OP, which might be able to perform the desired atomic-unlock-and-wake if that's deemed necessary. I haven't worked out the details for how it would work, but it seems promising.
Comment 42 Torvald Riegel 2014-06-25 14:34:13 UTC
Rich, thanks for the test case, but the code doesn't checks what the futex syscall returns.  The issue I'm concerned about is not whether there are spurious wake-ups, but that a spurious wake-up can incorrectly appear to be a non-spurious one due to the return value of the futex syscall reporting success.
Comment 43 Rich Felker 2014-06-25 16:00:52 UTC
Indeed, it's not testing that, but the syscall does return success. Just change the futex syscall line to if (!syscall(...)) and remove the final semicolon and it still eventually ends with Killed.
Comment 44 Torvald Riegel 2014-06-25 17:39:42 UTC
Looking closely, the wakeup seems to come from the CLONE_CHILD_CLEARTID that you used, which is specified as:
       CLONE_CHILD_CLEARTID (since Linux 2.5.49)
              Erase  child thread ID at location ctid in child memory when the
              child exits, and do a wakeup on the futex at that address.

Without this flag, I don't see non-spurious wakeups anymore.
Comment 45 Rich Felker 2014-06-25 18:03:33 UTC
But the child provably hasn't exited at the point where the wake occurs, since the subsequent tgkill syscall succeeds.

However if the issue is just that the kernel is doing these operations in the wrong order, it probably doesn't cause other spurious wakes to be observed, and it's probably not relevant to this PR.
Comment 46 Rich Felker 2014-08-08 20:47:41 UTC
I think I have a solution to the spurious-wakes problem using FUTEX_WAKE_OP. The principle is simple: use an atomic compare-and-swap to release the lock, so that we know before releasing (rather than immediately-after, as would be the case with a plain swap) whether there are potentially waiters. Then there are two cases:

1. Waiter count of zero and successful atomic CAS from owned/uncontended state to zero: nothing else to do.

2. Waiter count nonzero or failed atomic CAS (meaning the state changed from owned/uncontended to owned/contended): make a futex syscall with FUTEX_WAKE_OP to perform the atomic unlock AFTER having locked the futex hash buckets, so that a new waiter cannot arrive and thereby get a spurious wake.

This design is presently under discussion on the musl libc mailing list, in this thread:

http://www.openwall.com/lists/musl/2014/08/08/12

One issue is that the FUTEX_WAKE_OP command takes two addresses, uaddr1 and uaddr2, and performs a potentially expensive futex hash lookup on both. For our purposes we do not need two different addresses, so I think if we pursue this option, we should push the kernel folks to add a special case for uaddr1==uaddr2 that optimizes out the second hash.

Thoughts?
Comment 47 Torvald Riegel 2014-08-09 20:38:27 UTC
This would increase the unlock latency whenever there is any waiter (because we let the kernel do it, and after it has found and acquired the futex lock).  I don't have numbers for this increase, but if there's a non-neglible increase in latency, then I wouldn't want to see this in glibc.

I still think that the only thing we need to fix is to make sure that no program can interpret a spurious wake-up (by a pending futex_wake) as a real wake-up.
Comment 48 Rich Felker 2014-08-12 02:29:14 UTC
> This would increase the unlock latency whenever there is any waiter (because
> we let the kernel do it, and after it has found and acquired the futex lock). 
> I don't have numbers for this increase, but if there's a non-neglible increase
> in latency, then I wouldn't want to see this in glibc.

Torvald, I agree you have a legitimate concern (unlock latency), but while I don't have evidence to back this up (just high-level reasoning), I think the difference in time at which the atomic-store actually works in favor of performance with FUTEX_WAKE_OP. I'll try to explain:

In the case where there is no waiter at the time of unlock, no wake occurs, neither by FUTEX_WAKE nor FUTEX_WAKE_OP. There's only an atomic operation (CAS, if we want to fix the bug this whole issue tracker thread is about). So for the sake of comparing performance, we need to consider the case where there is at least one waiter.

Right now (with FUTEX_WAKE), there's a great deal of latency between the atomic operation that releases the lock and the FUTEX_WAKE being dispatched, due to kernel entry overhead, futex hash overhead, etc. During that window, a thread which is not a waiter can race for the lock and acquire it first, despite there being waiters. This acquisition inherently happens with very low latency, but I think it's actually likely to be bad for performance:

If the thread which "stole" the lock has not released it by the time the thread woken by FUTEX_WAKE gets scheduled, the latter thread will uselessly contend for the lock again, imposing additional cache synchronization overhead and an additional syscall to wait on the futex again. It will also wrongly get moved to the end of the wait queue.

If on the other hand, the thread which "stole" the lock immediately releases it, before the woken thread gets scheduled, my understanding is that it will see that there are waiters and issue an additional FUTEX_WAKE at unlock time. At the very least this is a wasted syscall. If there actually are two or more waiters, it's a lot more expensive, since an extra thread wakes up only to contend the lock and re-wait.

As both of these situations seem undesirable to me, I think the optimal behavior should be to minimize the latency between the atomic-release operation that makes the lock available to other threads and the futex wake. And the only way to make this latency small is to perform the atomic release in kernel space.

> I still think that the only thing we need to fix is to make sure that no
> program can interpret a spurious wake-up (by a pending futex_wake) as a real
> wake-up.

As I understand it, all of the current code treats futex wakes much like POSIX condition variable waits: as an indication to re-check an external predicate rather than as the bearer of notification about state. If not, things would already be a lot more broken than they are now in regards to this issue.

On the other hand, if you eliminate all sources of spurious wakes, I think it's possible to achieve better behavior; in particular I think it may be possible to prevent "stealing" of locks entirely and ensure that the next futex waiter always gets the lock on unlock. Whether this behavior is desirable for glibc or not, I'm not sure. I'm going to do research on it as a future possibility for musl.
Comment 49 Daniel Colascione 2015-05-30 18:25:36 UTC
What's blocking the application of the fix? I doubt spurious futex wakeups cause problems in practice.
Comment 50 Carlos O'Donell 2015-06-03 04:08:49 UTC
(In reply to Daniel Colascione from comment #49)
> What's blocking the application of the fix? I doubt spurious futex wakeups
> cause problems in practice.

Senior developers to review it, and collate a final consensus, and test the patch, and commit it. It's best to ping on libc-alpha and try to summarize a status for people like Rich and Torvald to help review.
Comment 51 Torvald Riegel 2015-07-14 20:23:40 UTC
The spurious wake-up issue has been discussed with the Kernel community, and the conclusion was to essentially document this behavior.  A current draft of an updated futex manpage has the following wording:

       FUTEX_WAIT
              Returns 0 if the caller was woken up.  Note that a  wake-up  can
              also  be caused by common futex usage patterns in unrelated code
              that happened to have previously used the  futex  word's  memory
              location  (e.g., typical futex-based implementations of Pthreads
              mutexes can cause this under some conditions).  Therefore, call‐
              ers should always conservatively assume that a return value of 0
              can mean a spurious wake-up, and  use  the  futex  word's  value
              (i.e., the user space synchronization scheme)
                  to decide whether to continue to block or not.

I've send an updated patch for review:
https://sourceware.org/ml/libc-alpha/2015-07/msg00411.html
Compared to the patch posted here, it fixes the problem in __lll_unlock and __lll_robust_unlock, and fixes a similar problem in __pthread_mutex_unlock_full.
Comment 52 Torvald Riegel 2015-12-23 19:28:19 UTC
This is now fixed on master.