Downstream bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=210693 The frysk TestProcStopped testMultiThreadedStoppedAckDaemon case hangs during teardown. Attempts are made to kill each of two ptrace-attached threads: kill -KILL 2741 kill -KILL 2741 kill -CONT 2741 kill -CONT 2741 detach -KILL 2741 detach -KILL 2741 but a subsequent waitpid(-1,...) blocks indefinitely, suggesting that the kill signals were never delivered. Kernel 2.6.18-1.2725.el5 How reproducible: 100% Steps to Reproduce: 1. Install a kernel with the latest utrace tatch. 2. Install and build frysk 3. cd to the frysk build directoy/frysk-core 4. Run ./TestRunner -c FINE frysk.proc.TestProcStopped Actual results: Test hangs after the testMultiThreadedStoppedAckDaemon. Expected results: Test Runs to completion. Additional info: May be related to bug 207674: PTRACE_DETACH doesn't deliver signals under utrace. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207674, the difference being that 207674 dealt with runnable procs but this bug deals with stopped procs. This is entirely my conjecture at this point, based on very little investigation.
Created attachment 1387 [details] C testcase for this bug Anyway, what the test does is start a few procs, reparent them to init, ptrace-attaches them, and then goes through the frysk tearDown sequence to detach and kill them. This works, init seems to get notified that the procs have been killed, but frysk doesn't.
Chris, I'm seeing this test also fail on FC-5? FAIL: frysk3381/reparent
I saw notes on what fail/pass behavior this test was looking for, can that be added here? Also, on FC-5, this test appears to largely pass (at least follow the expected behavior). Dies just at end, probably a nit, will create separate bug for that fix.
Created attachment 1421 [details] Testcase for this bug What appears to have been happening is that buggy kernels either don't deliver kill(pid, SIGKILL) signals to attached processes, or prevent the process from acting on that signal. This testcase creates and attaches a few child procs, waitpid()s to make sure the attach succeeds, then kill(pid, SIGKILL)s the procs. It then spins on a waitpid(-1, NULL, WNOHANG) until a non-positive pid is returned. If /no/ positive pids are returned, it is assumed that the kill()s did not succeed and the test fails; otherwise it passes. This test works as expected by passing on FC5 machines and an FC6 machine with a 2.6.18-1.2849.fc6 kernel, and failing otherwise.
Test f3381 is passing on broken FC 5 and FC 6 systems!
All f3381 does is check that a kill(pid, SIGKILL) to an attached stopped process actually succeeds in killing the process. Ptrace and some older version of utrace didn't do that and it appears that Roland did something to change the behaviour, presumably because SIGKILLs should always work. All that passing the test means is that SIGKILLs work under the circumstances described--it doesn't imply a thing about otherwise "broken FC 5 and FC 6 systems." The test that demonstrated this failure mode (frysk.proc.TestProcStopped.testMultiThreadedStoppedAckDaemon) is still failing, but it fails only intermittantly now and appears to be failing by a different mechanism. I'm trying to isolate the mechanism now--when I figure it out, I'll try to come up with another C testcase that demonstrates it reliably.
This looks very similar to: http://sourceware.org/bugzilla/show_bug.cgi?id=3595 for which I created a test and it gets the results: fail: 2.6.18-1.2239.fc5 (my machine) fail: 2.6.18-1.2849.fc6 (towns) pass: 2.6.17-1.2174_FC5 (toadstool)
*** This bug has been marked as a duplicate of 3595 ***