Bug 10575 - occasional stapio hangs for -c CMD
Summary: occasional stapio hangs for -c CMD
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: David Smith
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-29 20:27 UTC by Frank Ch. Eigler
Modified: 2009-11-05 13:56 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Ch. Eigler 2009-08-29 20:27:19 UTC
Intermittently, "stap .... -c FOOO" appears to hang, for normally
short-lived FOOO.  I don't know whether in these cases FOOO fails to
start, or whether its ending fails to be noticed by stapio, but
something is occasionally broken.
Comment 1 Frank Ch. Eigler 2009-08-31 17:03:06 UTC
In one scenario, the initial SIGUSR1 sent to the target_cmd-executing
stapio process appears to be lost (either not received, or sent before
the child program was listening for it, or perhaps not sent at all?!).
Comment 2 Josh Stone 2009-08-31 17:40:29 UTC
(In reply to comment #1)
> In one scenario, the initial SIGUSR1 sent to the target_cmd-executing
> stapio process appears to be lost (either not received, or sent before
> the child program was listening for it, or perhaps not sent at all?!).

Do your scripts have lots of output?  It could be related to #10189, where
STP_START gets lost in the overflow...

We also fork the child process before setting up signals, so we wouldn't see the
SIGCHLD if the it died to soon (i.e. before the SIGUSR1/exec stuff, but that
would be an abnormal termination).  The fix here is to prepare for SIGCHLD
before starting the child.

Another race I see is if the main process sent the SIGUSR1 before the child had
setup its handler -- this would cause the child to abort.  We should get a
SIGCHLD in this case though, so while not desirable, it wouldn't cause your
hang.  We should probably set SIGUSR1 blocked before forking anyway.

There's a tighter race between the child's calls to sigaction-ignore-SIGUSR1 and
then pause -- the signal could be lost in-between.  I believe sigsuspend would
handle this more atomically.
Comment 3 Mark Wielaard 2009-09-08 09:51:14 UTC
I am seeing the opposite, -c doesn't hang, but seems to fail to see the process
run at all. This seems to be caused by the workaround in mainloop.c for PR6964.
Maybe related, maybe not?
Comment 4 Mark Wielaard 2009-09-08 09:54:02 UTC
(In reply to comment #3)
> I am seeing the opposite, -c doesn't hang, but seems to fail to see the process
> run at all. This seems to be caused by the workaround in mainloop.c for PR6964.
> Maybe related, maybe not?

Forgot to add. This seems to be the cause of spurious testsuite failures where
lots of user space programs are run and probed for a short time with -c. Like
sdt.exp or exelib.exp. Those fail occasionally with "0 matches". But it is hard
to replicate by hand.
Comment 5 David Smith 2009-10-13 14:05:42 UTC
commit ba9abf3 should improve this situation by avoiding the pause()-based race
condition Josh mentioned in comment #2.  I've added code to use sigsuspend()
instead of pause() to avoid the race condition.

On the 2.6.31-tip kernel, I was seeing consistent failures from the
cmd_parse.exp testcase without this fix (but the specific test failures within
that testcase were random).  With commit ba9abf3, I get consistent passes with
cmd_parse.exp on that kernel.

However, because of the intermittent nature of this problem, it is possible
there are still other fixes to be made.  So, we'll leave this open for now.
Comment 6 Frank Ch. Eigler 2009-11-05 13:56:30 UTC
Issue not seen lately.