Bug 6841 - system-wide utrace syscall tracing script don't respond signal
Summary: system-wide utrace syscall tracing script don't respond signal
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: uprobes (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: David Smith
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-13 16:40 UTC by Masami Hiramatsu
Modified: 2008-08-29 15:06 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
signal sending trace logs (890 bytes, text/plain)
2008-08-18 17:44 UTC, Masami Hiramatsu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Masami Hiramatsu 2008-08-13 16:40:19 UTC
below script didn't stop when I sent ^C from terminal or SIGTERM from kill command.

---<suspicious2.stp>---
#! stap
global proc,name

probe process("*").syscall {
	proc[pid()] <<< 1
	name[pid()] = execname()
}

probe end {
	foreach(p+ in proc) {
		printf("%s(%d) issues syscall %d times\n",
			name[p], p, @sum(proc[p]))
	}
}
---

but once I sent ^Z and bg, it could catch the signal.

---<terminal log>
$ stap -vvvv suspicious2.stp 
<snip>
stapio:stp_main_loop:290 in main loop
stapio:stp_main_loop:297 nb=4
stapio:init_relayfs:124 initializing relayfs
stapio:init_relayfs:148 attempting to open
/sys/kernel/debug/systemtap/stap_3cb8c5f4cd6b7e714b961ad41f71374b_755/trace0
stapio:init_relayfs:148 attempting to open
/sys/kernel/debug/systemtap/stap_3cb8c5f4cd6b7e714b961ad41f71374b_755/trace1
stapio:init_relayfs:154 ncpus=1, bulkmode = 0
stapio:init_relayfs:204 starting threads
stapio:stp_main_loop:297 nb=12
stapio:stp_main_loop:337 probe_start() returned 0
                                                         <<(here I sent ^C and ^Z)
[1]+  Stopped                 stap -vvvv suspicious2.stp
[mhiramat@kumesen utrace]$ bg
[1]+ stap -vvvv suspicious2.stp &
                                                         <<(after bg, it caught
signals) 
stapio:signal_thread:33 sigproc 2 (Interrupt)
stapio:stp_main_loop:297 nb=4
stapio:stp_main_loop:330 got STP_EXIT
stapio:cleanup_and_exit:248 detach=0
stapio:close_relayfs:221 closing
auditd(2318) issues syscall 2 times
audispd(2320) issues syscall 1 times
irqbalance(2362) issues syscall 20 times
rpc.idmapd(2448) issues syscall 20 times
pcscd(2549) issues syscall 343 times
setroubleshootd(2579) issues syscall 664 times
automount(2595) issues syscall 27 times
python(2632) issues syscall 3 times
sendmail(2678) issues syscall 15 times
gpm(2699) issues syscall 28 times
avahi-daemon(2876) issues syscall 7 times
gam_server(2900) issues syscall 9 times
hald-addon-stor(2913) issues syscall 28 times
gdm-rh-security(3111) issues syscall 121 times
sshd(3182) issues syscall 137 times
bash(3183) issues syscall 71 times
stapio(4018) issues syscall 68 times
stapio:close_relayfs:240 done
stapio:cleanup_and_exit:261 closing control channel
stapio:cleanup_and_exit:267 removing stap_3cb8c5f4cd6b7e714b961ad41f71374b_755
Pass 5: run completed in 10usr/30sys/8380real ms.
Running rm -rf /tmp/stapZRbjve
Comment 1 Masami Hiramatsu 2008-08-13 20:19:56 UTC
I checked that this happened on x86-64/kernel-2.6.18-92.el5 and
i686/2.6.25.14-108.fc9.i686.
Comment 2 Masami Hiramatsu 2008-08-14 14:35:11 UTC
I narrowed it down to the following script:

---
#! stap
global proc

probe process("/usr/libexec/systemtap/stapio").syscall {
	proc[pid()] <<< 1
}

probe end {
	foreach(p+ in proc) {
		printf("pid(%d) issues syscall %d times\n",
			p, @sum(proc[p]))
	}
}
---
Comment 3 David Smith 2008-08-14 20:12:52 UTC
I've reproduced this on 2.6.25.14-108.fc9.x86_64.  With the addition of a timer
probe that exists after 5 seconds, the script exits correctly.

  probe timer.sec(5) {
	exit()
  }

One possible fix would be to exclude stapio from system-wide probing.
Comment 4 Masami Hiramatsu 2008-08-14 20:52:09 UTC
(In reply to comment #3)
> I've reproduced this on 2.6.25.14-108.fc9.x86_64.  With the addition of a timer
> probe that exists after 5 seconds, the script exits correctly.

I checked too. stapio is working, just fails to receive signals.

> One possible fix would be to exclude stapio from system-wide probing.

I think we should dig this issue deeper, and find the root cause of the issue.
If we could ensure that this issue *only* affects to "host" stapio, excluding
stapio will be acceptable.
Comment 5 David Smith 2008-08-14 21:08:15 UTC
Here's a somewhat unrelated issue.  Here's the output of the original script
after 5 seconds on a system not really doing anything:

  audispd(1526) issues syscall 2 times
  rsyslogd(1547) issues syscall 11 times
  hald-addon-stor(1794) issues syscall 7 times
  automount(1814) issues syscall 20 times
  setroubleshootd(1832) issues syscall 6 times
  sendmail(1969) issues syscall 10 times
  kerneloops(2005) issues syscall 5 times
  stapio(4070) issues syscall 53 times

Here's the same output with the timer set to 30 seconds:

  auditd(1524) issues syscall 2 times
  audispd(1526) issues syscall 3 times
  rsyslogd(1547) issues syscall 11 times
  rpcbind(1565) issues syscall 3 times
  hald(1698) issues syscall 4 times
  hald-addon-stor(1794) issues syscall 13 times
  automount(1814) issues syscall 120 times
  setroubleshootd(1832) issues syscall 31 times
  ntpd(1949) issues syscall 40 times
  sendmail(1969) issues syscall 55 times
  crond(1996) issues syscall 28 times
  kerneloops(2005) issues syscall 13 times
  stapio(4288) issues syscall 303 times

Over 5 and 30 seconds, stapio was the biggest issuer of system calls by far. 
stapio may be polling too often.
Comment 6 Masami Hiramatsu 2008-08-17 20:22:55 UTC
After deep investigation by stap, I found a strange behaviour of utrace.
At utrace_set_flags(), it calls quiesce() and set TIF_SIGPENDING flag to target 
thread, but utrace doesn't wake it up. If the target is waiting signal at that
time(ex. sigwait), it fails to wakeup when a signal was sent because it already
has TIF_SIGPENDING (see __group_complete_signal() and wants_signal() in
kernel/signal.c).

I think utrace_set_flags()/quiesce() should not use TIF_SIGPENDING.
Comment 7 Masami Hiramatsu 2008-08-18 17:44:16 UTC
Created attachment 2913 [details]
signal sending trace logs

Here is a pair of trace logs of stapio's signal handling.

the normal part shows normal behaviour of signal sending.
A SIGINT(2) was sent to stapio and __group_complete_signal()
found stapio's signal thread(2035) and woke it up.
This log also shows signal thread didn't have TIF_SIGPENDING.
 sigthread(2035){.state=1, .blocked=0, .flags=400040, .sigpending=0}

the strange part shows abnormal behaviour of signal sending.
A SIGINT(2) was sent to stapio, but __group_complete_signal()
couldn't find stapio's signal thread(1984) because it had 
TIF_SIGPENDING.
 sigthread(1984){.state=1, .blocked=0, .flags=400040, .sigpending=1}

As far as I can see, signal sending process seems expect that 
signal pending process is already running, not in sleep.

Thank you,
Comment 8 David Smith 2008-08-18 20:56:34 UTC
I believe I can work around this problem by stopping/quiescing threads before
attaching syscall utrace engines to them.
Comment 9 David Smith 2008-08-19 17:13:14 UTC
This should be fixed by commit 930a179.
Comment 10 David Smith 2008-08-29 15:06:21 UTC
I've added a test for this bug in utrace_p5.exp to make sure it stays fixed.