Bug 12683 - Race conditions in pthread cancellation
: Race conditions in pthread cancellation
Status: NEW
Product: glibc
Classification: Unclassified
Component: nptl
: unspecified
: P2 critical
: ---
Assigned To: Not yet assigned to anyone
:
:
:
:
  Show dependency treegraph
 
Reported: 2011-04-18 22:28 UTC by Rich Felker
Modified: 2012-09-22 23:13 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Demonstration of file descriptor leak due to problem 1 (894 bytes, text/x-csrc)
2011-04-18 22:28 UTC, Rich Felker
Details
Demonstration of problem 2 (930 bytes, text/plain)
2011-04-18 22:34 UTC, Rich Felker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rich Felker 2011-04-18 22:28:16 UTC
Created attachment 5676 [details]
Demonstration of file descriptor leak due to problem 1

The current approach to implementing pthread cancellation points is to enable
asynchronous cancellation prior to making the syscall, and restore the previous
cancellation type once the syscall returns. I've asked around and heard
conflicting answers as to whether this violates the requirements in POSIX (I
believe it does), but either way, from a quality of implementation standpoint
this approach is very undesirable due to at least 2 problems, the latter of
which is very serious:

1. Cancellation can act after the syscall has returned from kernelspace, but
before userspace saves the return value. This results in a resource leak if the
syscall allocated a resource, and there is no way to patch over it with
cancellation handlers. Even if the syscall did not allocate a resource, it may
have had an effect (like consuming data from a socket/pipe/terminal buffer)
which the application will never see.

2. If a signal is handled while the thread is blocked at a cancellable syscall,
the entire signal handler runs with asynchronous cancellation enabled. This
could be extremely dangerous, since the signal handler may call functions which
are async-signal-safe but not async-cancel-safe. Even worse, the signal handler
may call functions which are not even async-signal-safe (like stdio) if it
knows the interrupted code could only be using async-signal-safe functions, and
having a thread asynchronously terminated while modifying such functions'
internal data structures could lead to serious program malfunction.

I am attaching simple programs which demonstrate both issues.

The solution to problem 2 is making the thread's current execution context
(e.g. stack pointer) at syscall time part of the cancellability state, so that
cancellation requests received while the cancellation point is interrupted by a
signal handler can identify that the thread is not presently in the cancellable
context.

The solution to problem 1 is making successful return from kernelspace and
exiting the cancellable state an atomic operation. While at first this seems
impossible without kernel support, I have a working implementation in musl
(http://www.etalabs.net/musl) which solves both problems.
Comment 1 Rich Felker 2011-04-18 22:34:44 UTC
Created attachment 5677 [details]
Demonstration of problem 2

This program should hang, or possibly print x=0 if scheduling is really wacky.
If it exits printing a nonzero value of the volatile variable x, this means the
signal handler wrongly executed under asynchronous cancellation.
Comment 2 Rich Felker 2011-09-21 18:30:01 UTC
It's been 5 months since I filed this bug and there's been no response. I
believe this issue it important enough to at least deserve a response. From my
perspective, it makes NPTL's pthread_cancel essentially unusable. I've even
included a proposed solution (albeit not a patch). Getting a confirmation that
you acknowledge the issue exists and are open to a solution would open the door
for somebody to start the work to integrate the solution with glibc/NPTL and
eventually get it fixed.
Comment 3 Rich Felker 2012-04-29 02:55:59 UTC
Ping. Now that there's some will to revisit bugs that have been long-ignored,
is anyone willing to look into and confirm the problem I've reported here? I
believe problem 2 is extremely serious and could lead to timing-based attacks
that corrupt memory and result in deadlocks or worse. Problem 1 is also serious
for long-lived processes with high reliability requirements that use thread
cancellation, as rare but undetectable resource leaks are nearly inevitable and
will accumulate over time.
Comment 4 Rich Felker 2012-09-22 23:13:30 UTC
I just added a detailed analysis of this bug on my blog at
http://ewontfix.com/2/