Sources Bugzilla – Bug 12683
Race conditions in pthread cancellation
Last modified: 2012-09-22 23:13:30 UTC
Created attachment 5676 [details] Demonstration of file descriptor leak due to problem 1 The current approach to implementing pthread cancellation points is to enable asynchronous cancellation prior to making the syscall, and restore the previous cancellation type once the syscall returns. I've asked around and heard conflicting answers as to whether this violates the requirements in POSIX (I believe it does), but either way, from a quality of implementation standpoint this approach is very undesirable due to at least 2 problems, the latter of which is very serious: 1. Cancellation can act after the syscall has returned from kernelspace, but before userspace saves the return value. This results in a resource leak if the syscall allocated a resource, and there is no way to patch over it with cancellation handlers. Even if the syscall did not allocate a resource, it may have had an effect (like consuming data from a socket/pipe/terminal buffer) which the application will never see. 2. If a signal is handled while the thread is blocked at a cancellable syscall, the entire signal handler runs with asynchronous cancellation enabled. This could be extremely dangerous, since the signal handler may call functions which are async-signal-safe but not async-cancel-safe. Even worse, the signal handler may call functions which are not even async-signal-safe (like stdio) if it knows the interrupted code could only be using async-signal-safe functions, and having a thread asynchronously terminated while modifying such functions' internal data structures could lead to serious program malfunction. I am attaching simple programs which demonstrate both issues. The solution to problem 2 is making the thread's current execution context (e.g. stack pointer) at syscall time part of the cancellability state, so that cancellation requests received while the cancellation point is interrupted by a signal handler can identify that the thread is not presently in the cancellable context. The solution to problem 1 is making successful return from kernelspace and exiting the cancellable state an atomic operation. While at first this seems impossible without kernel support, I have a working implementation in musl (http://www.etalabs.net/musl) which solves both problems.
Created attachment 5677 [details] Demonstration of problem 2 This program should hang, or possibly print x=0 if scheduling is really wacky. If it exits printing a nonzero value of the volatile variable x, this means the signal handler wrongly executed under asynchronous cancellation.
It's been 5 months since I filed this bug and there's been no response. I believe this issue it important enough to at least deserve a response. From my perspective, it makes NPTL's pthread_cancel essentially unusable. I've even included a proposed solution (albeit not a patch). Getting a confirmation that you acknowledge the issue exists and are open to a solution would open the door for somebody to start the work to integrate the solution with glibc/NPTL and eventually get it fixed.
Ping. Now that there's some will to revisit bugs that have been long-ignored, is anyone willing to look into and confirm the problem I've reported here? I believe problem 2 is extremely serious and could lead to timing-based attacks that corrupt memory and result in deadlocks or worse. Problem 1 is also serious for long-lived processes with high reliability requirements that use thread cancellation, as rare but undetectable resource leaks are nearly inevitable and will accumulate over time.
I just added a detailed analysis of this bug on my blog at http://ewontfix.com/2/