This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: FAIL nptl/tst-robustpi4 [BZ 23183]


On 06/29/2018 08:54 AM, Stefan Liebler wrote:
On 01/26/2017 05:22 PM, Torvald Riegel wrote:
On Thu, 2017-01-26 at 11:12 -0500, Carlos O'Donell wrote:
On 01/26/2017 10:29 AM, Stefan Liebler wrote:
It seems as a race between futex- and exit-syscall causes ESRCH
result from futex-syscall.

I'll have a closer look at this.

I see those fails with Linux 4.8 / 4.9 running in a z/VM guest as
well as with 4.6 on a LPAR (but less often).

I've seen tst-robustpi7 and tst-robustpi8 failures on all hardware
across a wide number of kernels, but never tst-robustpi4.

https://sourceware.org/bugzilla/show_bug.cgi?id=19004

The robustpi support is certainly not very robust as Torvald's
recent fixes show, and there still remains at least one design
flaw that can't be fixed.

e.g.
https://sourceware.org/bugzilla/show_bug.cgi?id=14485

The underlying problem for that bug does not affect PI+robust, just
robust, I think.  Unless I forgot about something, PI+robust should
always use the kernel to unlock.

in the meantime, Florian Weimer could also reproduce this issue and opened the bugzilla Bug 23183 - tst-robustpi4 test failure (https://sourceware.org/bugzilla/show_bug.cgi?id=23183).

I've also dig a bit deeper - see details in bugzilla - and was also able to reproduce it on intel.

If the thread with locked mutex is executing the exit-syscall
while the main-thread is executing the futex-syscall,
then it could lead to this ESRCH return value of the futex-syscall which triggers the assertion.

In this situation, the futex-syscall has already added the FUTEX_WAITERS bit to the lock-value and is then calling attach_to_pi_owner().

The exit-syscall is now setting the lock-value to FUTEX_WAITERS | FUTEX_OWNER_DIED and is proceeding.

attach_to_pi_owner() is now e.g. trying to get the owner-task and/or is testing if the owner is currently exiting. In those cases, ESRCH is returned!

Does the kernel look at the TID and determine that it no longer exists, or does it use the FUTEX_OWNER_DIED bit to detect this situation?

I'm worried that using the TID introduces a TID race here.

Thanks,
Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]