Bug 3381

Summary: Multiple stopped threads aren't terminating
Product: frysk Reporter: Andrew Cagney <cagney>
Component: generalAssignee: Chris Moller <cmoller>
Status: RESOLVED DUPLICATE    
Severity: normal CC: cmoller
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Bug Depends on: 3502    
Bug Blocks: 3595    
Attachments: C testcase for this bug
Testcase for this bug

Description Andrew Cagney 2006-10-18 14:44:05 UTC
Downstream bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=210693

The frysk TestProcStopped testMultiThreadedStoppedAckDaemon case hangs during
teardown.  Attempts are made to kill each of two ptrace-attached threads:

kill -KILL 2741
kill -KILL 2741
kill -CONT 2741
kill -CONT 2741
detach -KILL 2741
detach -KILL 2741

but a subsequent waitpid(-1,...) blocks indefinitely, suggesting that the kill
signals were never delivered.


Kernel 2.6.18-1.2725.el5


How reproducible:
100%


Steps to Reproduce:
1.  Install a kernel with the latest utrace tatch.
2.  Install and build frysk
3.  cd to the frysk build directoy/frysk-core
4.  Run ./TestRunner -c FINE frysk.proc.TestProcStopped
  
Actual results:
Test hangs after the testMultiThreadedStoppedAckDaemon.

Expected results:
Test Runs to completion.


Additional info:
May be related to bug 207674: PTRACE_DETACH doesn't deliver signals under
utrace. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207674, the
difference being that 207674 dealt with runnable procs but this bug deals with
stopped procs.  This is entirely my conjecture at this point, based on very
little investigation.
Comment 1 Chris Moller 2006-10-23 15:17:45 UTC
Created attachment 1387 [details]
C testcase for this bug

Anyway, what the test does is start a few procs, reparent them to init,
ptrace-attaches them, and then goes through the frysk tearDown sequence
to detach and kill them.  This works, init seems to get notified that
the procs have been killed, but frysk doesn't.
Comment 2 Andrew Cagney 2006-10-24 14:36:03 UTC
Chris, I'm seeing this test also fail on FC-5?
FAIL: frysk3381/reparent
Comment 3 Andrew Cagney 2006-11-10 18:01:27 UTC
I saw notes on what fail/pass behavior this test was looking for, can that be
added here?

Also, on FC-5, this test appears to largely pass (at least follow the expected
behavior).  Dies just at end, probably a nit, will create separate bug for that fix.
Comment 4 Chris Moller 2006-11-15 05:06:42 UTC
Created attachment 1421 [details]
Testcase for this bug

What appears to have been happening is that buggy kernels either don't deliver
kill(pid, SIGKILL) signals to attached processes, or prevent the process from
acting on that signal.	This testcase creates and attaches a few child procs,
waitpid()s to make sure the attach succeeds, then kill(pid, SIGKILL)s the
procs.	It then spins on a waitpid(-1, NULL, WNOHANG) until a non-positive pid
is returned.  If /no/ positive pids are returned, it is assumed that the
kill()s did not succeed and the test fails; otherwise it passes.

This test works as expected by passing on FC5 machines and an FC6 machine with
a 2.6.18-1.2849.fc6 kernel, and failing otherwise.
Comment 5 Andrew Cagney 2006-11-23 21:00:55 UTC
Test f3381 is passing on broken FC 5 and FC 6 systems!
Comment 6 Chris Moller 2006-11-30 16:46:41 UTC
All f3381 does is check that a kill(pid, SIGKILL) to an attached stopped process
actually succeeds in killing the process.  Ptrace and some older version of
utrace didn't do that and it appears that Roland did something to change the
behaviour, presumably because SIGKILLs should always work.  All that passing the
test means is that SIGKILLs work under the circumstances described--it doesn't
imply a thing  about otherwise "broken FC 5 and FC 6 systems."

The test that demonstrated this failure mode
(frysk.proc.TestProcStopped.testMultiThreadedStoppedAckDaemon) is still failing,
but it fails only intermittantly now and appears to be failing by a different
mechanism.  I'm trying to isolate the mechanism now--when I figure it out, I'll
try to come up with another C testcase that demonstrates it reliably.
Comment 7 Andrew Cagney 2006-11-30 20:04:18 UTC
This looks very similar to:
  http://sourceware.org/bugzilla/show_bug.cgi?id=3595
for which I created a test and it gets the results:
 fail: 2.6.18-1.2239.fc5 (my machine)
 fail: 2.6.18-1.2849.fc6  (towns)
 pass: 2.6.17-1.2174_FC5 (toadstool)
Comment 8 Andrew Cagney 2006-12-15 21:27:21 UTC

*** This bug has been marked as a duplicate of 3595 ***