Since commit c5a0802a682dba23f92d47f0f99775aebfbe2539 (Handle EAGAIN from FUTEX_WAIT_REQUEUE_PI), pulseaudio apps often hang.
Example backtrace from gnome-shell and libcanberra:
Thread 1 (Thread 0x7f975cfe8900 (LWP 2367)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:144
#1 0x00007f97558a4aa0 in pa_threaded_mainloop_wait (m=0x21bdf00) at pulse/thread-mainloop.c:206
#2 0x00007f97460aaa0b in pulse_driver_play (c=0x21b5280, id=<optimized
out>, proplist=0x704c9c0, cb=<optimized out>, userdata=<optimized
out>) at pulse.c:1085
#3 0x00007f975955624e in ca_context_play_full (c=c@entry=0x21b5280,
id=id@entry=1, p=0x704c9c0, cb=cb@entry=0x0,
userdata=userdata@entry=0x0) at common.c:522
#4 0x00007f97595565cf in ca_context_play (c=0x21b5280, id=1) at common.c:462
Most distributions (Fedora, Debian, openSUSE) revert this patch.
Quoting an email from Jeff below
An FYI, this patch:
Author: Andreas Schwab <firstname.lastname@example.org>
Date: Mon Nov 28 13:38:19 2011 +0100
Handle EAGAIN from FUTEX_WAIT_REQUEUE_PI
Has been reported as causing numerous problems in Fedora & Debian. I
don't think anyone has done any serious analysis of the issue, but the
patch has been pulled from both distributions because of the
instability it's introduced.
It might be worth your time to dig further into the change or pull it
from 2.15 pending a deeper investigation.
This is going to be a tricky issue to resolve.
We don't have a smaller test case than "stuff stops working well", but clearly the range of failures means that the changes in libc are broken.
Have you found a smaller test case that uses just pthreads?
In the meantime I do think the right solution is to revert the patch. We'll see if we can do that for 2.17.
Thanks for opening the report. It's been on my TODO list for a while.
If someone has a good understanding of the core pthread_condattr_t structure, particularly the relationships between the _seq fields, documenting them might be a significant help.
One of the things that was incredibly frustrating trying to debug this issue was the inadequate documentation of key data structures.
Carlos: sorry, I don't have a more specific test case. I just noticed the pulsaudio issue after we upgraded from glibc 2.14 to 2.16 in Mageia, and that all other mainstream distributions revert his commit
pulseaudio is the current best known way to reproduce the problem.
AUDIODRIVER=pulseaudio play -n -c1 synth whitenoise band -n 100 20 \
band -n 50 20 gain +25 fade h 1 864000 1
Fails once or twice every ten attempts within the first few seconds. I was able to make it fail regularly with a strategic breakpoint in the low level pthread code after releasing on the locks (details fade, but clearly it depends on arranging the the kernel to return EAGAIN to the futex call).
I would hesitate to try fixing this without also looking at and possibly fixing bug #12875. The whole sequence number wakeup thing for condition variables is fundamentally broken in NPTL and needs to be fixed. Basically the issue is that the current code is over-engineered to avoid spurious wakeups, but in the process it suppresses some non-spurious wakeups...
(In reply to comment #6)
> I would hesitate to try fixing this without also looking at and possibly fixing
> bug #12875. The whole sequence number wakeup thing for condition variables is
> fundamentally broken in NPTL and needs to be fixed.
How so? You seem to see issues that go beyond bug #12875 (which I don't see as being a bug currently). If so, please link to them here.
> Basically the issue is that
> the current code is over-engineered to avoid spurious wakeups, but in the
> process it suppresses some non-spurious wakeups...
I'm not aware of any lost wake-ups for non-PI cond vars. If you have provided an alternative implementation proposal in the past, could you link to it when making such comments please? If you haven't but have a proposal now, please link to it here or post to glibc-alpha.
Fixed in master: