Bug 5240 - Pthread hang where there are still waiters when mutex is in "unlocked" state.
Summary: Pthread hang where there are still waiters when mutex is in "unlocked" state.
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: unspecified
: P2 critical
Target Milestone: ---
Assignee: Ulrich Drepper
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-01 17:03 UTC by Ryan S. Arnold
Modified: 2007-11-27 23:20 UTC (History)
2 users (show)

See Also:
Host: powerpc-linux
Target: powerpc-linux
Build: powerpc-linux
Last reconfirmed:


Attachments
pthread hang (1.58 KB, text/plain)
2007-11-01 17:05 UTC, Ryan S. Arnold
Details
patch to avoid the hang by awakening waiters before returning TIMEOUT. (409 bytes, patch)
2007-11-01 17:55 UTC, Ryan S. Arnold
Details | Diff
Simplified testcase with cleaner termination path. (1.53 KB, text/plain)
2007-11-27 23:20 UTC, Ryan S. Arnold
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ryan S. Arnold 2007-11-01 17:03:11 UTC
A customer identified a potential race condition in
nptl/sysdeps/unix/sysv/linux/lowlevellock.c (__lll_timedlock_wait) which causes
waiting threads not to be woken up under certain circumstances.

Reproduced on:
SMP PowerPC 970
SMP POWER6 (with SMT)
SMP POWER5 (with SMT)
Uni PowerPC 440

Not reproduced on:
Uni Intel Pentium M.
SMP Intel Core 2 Duo.

Consider three threads, "A" holding a lock, "B" blocked in a timed
wait on the same lock, and "C" also blocked on that lock. The value of
the futex is 2.  Then:

- "A" releases the lock, setting the futex value to 0 and waking up
  "B".
- Before "B" performs any further action, "A" continues to execute and
  acquires the lock again, setting the futex value to 1.
- "B" checks the while condition in __lll_timedlock_wait:
  while (atomic_compare_and_exchange_bool_acq (futex, 2, 0) != 0);
  The condition is true, so "B" iterates the do-while loop.
- "B" hits the timeout and returns ETIMEDOUT.
- "A" releases the lock, setting the futex value from 1 to 0 (without
  wakeup).

At the end, "C" is left waiting, and the futex value is 0.

Testcase forthcoming...
Comment 1 Ryan S. Arnold 2007-11-01 17:05:03 UTC
Created attachment 2070 [details]
pthread hang

Associated test-case to reproduce on PowerPC hardware.
Comment 2 Ryan S. Arnold 2007-11-01 17:55:30 UTC
Created attachment 2071 [details]
patch to avoid the hang by awakening waiters before returning TIMEOUT.

The following patch ensures that waiters will be awoken before returning the
timeout.  This patch avoids an unnecessary system call in the usual timeout
case.  

A simpler solution if we don't care about the system call cost would be to
unconditionally invoke lll_futex_wake before returning.

I've verified that this patch does indeed prevent the hang scenario described.
Comment 3 Ulrich Drepper 2007-11-09 09:10:46 UTC
The analysis is correct but the patch is less than optimal.  I've checked in
something different and also fixed x86 and x86-64.
Comment 5 Ryan S. Arnold 2007-11-09 23:05:18 UTC
I tested the patch on a Power5 machine and I'm still encountering the hang. 
Others indicate that they're getting the hang as well on different classes of
PowerPC hardware.  Is there any information you'd like me to gather to determine
why it's still happening?
Comment 6 Ulrich Drepper 2007-11-24 04:46:41 UTC
I've made some more changes (and some optimizations).  The current code should work.
Comment 7 Ryan S. Arnold 2007-11-27 23:20:26 UTC
Created attachment 2112 [details]
Simplified testcase with cleaner termination path.

The fixed worked perfectly on POWER6.  On POWER5 I kept running into a
segmentation fault in the exit() path of the test-case.

The test-case is problematic since the exit() in the child thread's
thread_exit() function causes process termination which ends up sending two
threads down the glibc exit() pipeline at the same time and the linked list of
exit handlers and ends up dereferencing a pointer which has already been
zeroed.

I've modified the test case to demonstrate a more appropriate exit strategy
(which also ends up simplifying the testcase).

I think this bug is resolved.

Thanks for the fix Ulrich.