Created attachment 5950 [details]
Test program which exhibits performance difference pshared/non-pshared.
The attached program shows NPTL's non-process-shared condition variables (which utilize futex requeue) performing significantly worse than process-shared ones (which simply use a broadcast futex wake). On my machine (Atom N280 dual core) it takes ~11.7 seconds with non-pshared cond var and ~5.3 seconds with a pshared cond var (comment/uncomment the pthread_cond_init line to change which is used).
Of course requeue-based broadcast should scale better to huge numbers of waiters. This test program only has 5 waiters. Still, the performance should not be this bad. With musl libc, I get comparable performance with pshared and non-pshared cond var (and both ways outperform NPTL, with run times around 2.5-3 seconds).
If you're unwilling to properly fix whatever's making it slow, perhaps just using a broadcast futex wake rather than the requeue code whenever the number of waiters is less than ~10 would be an easy "fix"...
BTW, I suspect the overly-complex sequencing code aimed at minimizing spurious wakes, which also seems responsible for bugs 12875 and 13165, is probably part of the problem...
I can't reproduce this on x86_64 RHEL7 (old condvar algorithm). The new condvar algorithm doesn't use requeue, so it should also not be affected. Therefore, I'll close this bug.