Intermittently, "stap .... -c FOOO" appears to hang, for normally short-lived FOOO. I don't know whether in these cases FOOO fails to start, or whether its ending fails to be noticed by stapio, but something is occasionally broken.
In one scenario, the initial SIGUSR1 sent to the target_cmd-executing stapio process appears to be lost (either not received, or sent before the child program was listening for it, or perhaps not sent at all?!).
(In reply to comment #1) > In one scenario, the initial SIGUSR1 sent to the target_cmd-executing > stapio process appears to be lost (either not received, or sent before > the child program was listening for it, or perhaps not sent at all?!). Do your scripts have lots of output? It could be related to #10189, where STP_START gets lost in the overflow... We also fork the child process before setting up signals, so we wouldn't see the SIGCHLD if the it died to soon (i.e. before the SIGUSR1/exec stuff, but that would be an abnormal termination). The fix here is to prepare for SIGCHLD before starting the child. Another race I see is if the main process sent the SIGUSR1 before the child had setup its handler -- this would cause the child to abort. We should get a SIGCHLD in this case though, so while not desirable, it wouldn't cause your hang. We should probably set SIGUSR1 blocked before forking anyway. There's a tighter race between the child's calls to sigaction-ignore-SIGUSR1 and then pause -- the signal could be lost in-between. I believe sigsuspend would handle this more atomically.
I am seeing the opposite, -c doesn't hang, but seems to fail to see the process run at all. This seems to be caused by the workaround in mainloop.c for PR6964. Maybe related, maybe not?
(In reply to comment #3) > I am seeing the opposite, -c doesn't hang, but seems to fail to see the process > run at all. This seems to be caused by the workaround in mainloop.c for PR6964. > Maybe related, maybe not? Forgot to add. This seems to be the cause of spurious testsuite failures where lots of user space programs are run and probed for a short time with -c. Like sdt.exp or exelib.exp. Those fail occasionally with "0 matches". But it is hard to replicate by hand.
commit ba9abf3 should improve this situation by avoiding the pause()-based race condition Josh mentioned in comment #2. I've added code to use sigsuspend() instead of pause() to avoid the race condition. On the 2.6.31-tip kernel, I was seeing consistent failures from the cmd_parse.exp testcase without this fix (but the specific test failures within that testcase were random). With commit ba9abf3, I get consistent passes with cmd_parse.exp on that kernel. However, because of the intermittent nature of this problem, it is possible there are still other fixes to be made. So, we'll leave this open for now.
Issue not seen lately.