Bug 14417 - EAGAIN handling in pthread_cond_wait make pulseaudio hang
Summary: EAGAIN handling in pthread_cond_wait make pulseaudio hang
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: 2.16
: P2 normal
Target Milestone: 2.17
Assignee: Siddhesh Poyarekar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-29 14:33 UTC by Olivier Blin
Modified: 2014-06-25 10:50 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier Blin 2012-07-29 14:33:34 UTC
Since commit c5a0802a682dba23f92d47f0f99775aebfbe2539 (Handle EAGAIN from FUTEX_WAIT_REQUEUE_PI), pulseaudio apps often hang.

Example backtrace from gnome-shell and libcanberra:

Thread 1 (Thread 0x7f975cfe8900 (LWP 2367)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:144
#1  0x00007f97558a4aa0 in pa_threaded_mainloop_wait (m=0x21bdf00) at pulse/thread-mainloop.c:206
#2 0x00007f97460aaa0b in pulse_driver_play (c=0x21b5280, id=<optimized
out>, proplist=0x704c9c0, cb=<optimized out>, userdata=<optimized
out>) at pulse.c:1085
#3 0x00007f975955624e in ca_context_play_full (c=c@entry=0x21b5280,
id=id@entry=1, p=0x704c9c0, cb=cb@entry=0x0,
userdata=userdata@entry=0x0) at common.c:522
#4  0x00007f97595565cf in ca_context_play (c=0x21b5280, id=1) at common.c:462
...
Comment 1 Olivier Blin 2012-07-29 14:40:13 UTC
Most distributions (Fedora, Debian, openSUSE) revert this patch.

Quoting an email from Jeff below
http://sourceware.org/ml/libc-alpha/2012-01/msg00002.html

An FYI, this patch:


commit c5a0802a682dba23f92d47f0f99775aebfbe2539
Author: Andreas Schwab <schwab@redhat.com>
Date:   Mon Nov 28 13:38:19 2011 +0100

    Handle EAGAIN from FUTEX_WAIT_REQUEUE_PI


Has been reported as causing numerous problems in Fedora & Debian.  I
don't think anyone has done any serious analysis of the issue, but the
patch has been pulled from both distributions because of the
instability it's introduced.

https://bugzilla.redhat.com/show_bug.cgi?id=769421
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=651899

It might be worth your time to dig further into the change or pull it
from 2.15 pending a deeper investigation.
Comment 2 Carlos O'Donell 2012-08-03 03:43:48 UTC
This is going to be a tricky issue to resolve.

We don't have a smaller test case than "stuff stops working well", but clearly the range of failures means that the changes in libc are broken.

Have you found a smaller test case that uses just pthreads?

In the meantime I do think the right solution is to revert the patch. We'll see if we can do that for 2.17.
Comment 3 law 2012-08-03 17:52:01 UTC
Olivier,
Thanks for opening the report.  It's been on my TODO list for a while.

Everyone else,
If someone has a good understanding of the core pthread_condattr_t structure, particularly the relationships between the _seq fields, documenting them might be a significant help.

One of the things that was incredibly frustrating trying to debug this issue was the inadequate documentation of key data structures.
Comment 4 Olivier Blin 2012-08-03 21:55:44 UTC
Carlos: sorry, I don't have a more specific test case. I just noticed the pulsaudio issue after we upgraded from glibc 2.14 to 2.16 in Mageia, and that all other mainstream distributions revert his commit
Comment 5 law 2012-08-03 22:00:22 UTC
pulseaudio is the current best known way to reproduce the problem. 


AUDIODRIVER=pulseaudio play -n -c1  synth whitenoise band -n 100 20 \
        band -n 50 20 gain +25 fade h 1 864000 1


Fails once or twice every ten attempts within the first few seconds.  I was able to make it fail regularly with a strategic breakpoint in the low level pthread code after releasing on the locks (details fade, but clearly it depends on arranging the the kernel to return EAGAIN to the futex call).
Comment 6 Rich Felker 2012-08-04 20:49:17 UTC
I would hesitate to try fixing this without also looking at and possibly fixing bug #12875. The whole sequence number wakeup thing for condition variables is fundamentally broken in NPTL and needs to be fixed. Basically the issue is that the current code is over-engineered to avoid spurious wakeups, but in the process it suppresses some non-spurious wakeups...
Comment 7 Torvald Riegel 2012-09-18 14:37:24 UTC
(In reply to comment #6)
> I would hesitate to try fixing this without also looking at and possibly fixing
> bug #12875. The whole sequence number wakeup thing for condition variables is
> fundamentally broken in NPTL and needs to be fixed.

How so?  You seem to see issues that go beyond bug #12875 (which I don't see as being a bug currently).  If so, please link to them here.

> Basically the issue is that
> the current code is over-engineered to avoid spurious wakeups, but in the
> process it suppresses some non-spurious wakeups...

I'm not aware of any lost wake-ups for non-PI cond vars.  If you have provided an alternative implementation proposal in the past, could you link to it when making such comments please?  If you haven't but have a proposal now, please link to it here or post to glibc-alpha.