below script didn't stop when I sent ^C from terminal or SIGTERM from kill command. ---<suspicious2.stp>--- #! stap global proc,name probe process("*").syscall { proc[pid()] <<< 1 name[pid()] = execname() } probe end { foreach(p+ in proc) { printf("%s(%d) issues syscall %d times\n", name[p], p, @sum(proc[p])) } } --- but once I sent ^Z and bg, it could catch the signal. ---<terminal log> $ stap -vvvv suspicious2.stp <snip> stapio:stp_main_loop:290 in main loop stapio:stp_main_loop:297 nb=4 stapio:init_relayfs:124 initializing relayfs stapio:init_relayfs:148 attempting to open /sys/kernel/debug/systemtap/stap_3cb8c5f4cd6b7e714b961ad41f71374b_755/trace0 stapio:init_relayfs:148 attempting to open /sys/kernel/debug/systemtap/stap_3cb8c5f4cd6b7e714b961ad41f71374b_755/trace1 stapio:init_relayfs:154 ncpus=1, bulkmode = 0 stapio:init_relayfs:204 starting threads stapio:stp_main_loop:297 nb=12 stapio:stp_main_loop:337 probe_start() returned 0 <<(here I sent ^C and ^Z) [1]+ Stopped stap -vvvv suspicious2.stp [mhiramat@kumesen utrace]$ bg [1]+ stap -vvvv suspicious2.stp & <<(after bg, it caught signals) stapio:signal_thread:33 sigproc 2 (Interrupt) stapio:stp_main_loop:297 nb=4 stapio:stp_main_loop:330 got STP_EXIT stapio:cleanup_and_exit:248 detach=0 stapio:close_relayfs:221 closing auditd(2318) issues syscall 2 times audispd(2320) issues syscall 1 times irqbalance(2362) issues syscall 20 times rpc.idmapd(2448) issues syscall 20 times pcscd(2549) issues syscall 343 times setroubleshootd(2579) issues syscall 664 times automount(2595) issues syscall 27 times python(2632) issues syscall 3 times sendmail(2678) issues syscall 15 times gpm(2699) issues syscall 28 times avahi-daemon(2876) issues syscall 7 times gam_server(2900) issues syscall 9 times hald-addon-stor(2913) issues syscall 28 times gdm-rh-security(3111) issues syscall 121 times sshd(3182) issues syscall 137 times bash(3183) issues syscall 71 times stapio(4018) issues syscall 68 times stapio:close_relayfs:240 done stapio:cleanup_and_exit:261 closing control channel stapio:cleanup_and_exit:267 removing stap_3cb8c5f4cd6b7e714b961ad41f71374b_755 Pass 5: run completed in 10usr/30sys/8380real ms. Running rm -rf /tmp/stapZRbjve
I checked that this happened on x86-64/kernel-2.6.18-92.el5 and i686/2.6.25.14-108.fc9.i686.
I narrowed it down to the following script: --- #! stap global proc probe process("/usr/libexec/systemtap/stapio").syscall { proc[pid()] <<< 1 } probe end { foreach(p+ in proc) { printf("pid(%d) issues syscall %d times\n", p, @sum(proc[p])) } } ---
I've reproduced this on 2.6.25.14-108.fc9.x86_64. With the addition of a timer probe that exists after 5 seconds, the script exits correctly. probe timer.sec(5) { exit() } One possible fix would be to exclude stapio from system-wide probing.
(In reply to comment #3) > I've reproduced this on 2.6.25.14-108.fc9.x86_64. With the addition of a timer > probe that exists after 5 seconds, the script exits correctly. I checked too. stapio is working, just fails to receive signals. > One possible fix would be to exclude stapio from system-wide probing. I think we should dig this issue deeper, and find the root cause of the issue. If we could ensure that this issue *only* affects to "host" stapio, excluding stapio will be acceptable.
Here's a somewhat unrelated issue. Here's the output of the original script after 5 seconds on a system not really doing anything: audispd(1526) issues syscall 2 times rsyslogd(1547) issues syscall 11 times hald-addon-stor(1794) issues syscall 7 times automount(1814) issues syscall 20 times setroubleshootd(1832) issues syscall 6 times sendmail(1969) issues syscall 10 times kerneloops(2005) issues syscall 5 times stapio(4070) issues syscall 53 times Here's the same output with the timer set to 30 seconds: auditd(1524) issues syscall 2 times audispd(1526) issues syscall 3 times rsyslogd(1547) issues syscall 11 times rpcbind(1565) issues syscall 3 times hald(1698) issues syscall 4 times hald-addon-stor(1794) issues syscall 13 times automount(1814) issues syscall 120 times setroubleshootd(1832) issues syscall 31 times ntpd(1949) issues syscall 40 times sendmail(1969) issues syscall 55 times crond(1996) issues syscall 28 times kerneloops(2005) issues syscall 13 times stapio(4288) issues syscall 303 times Over 5 and 30 seconds, stapio was the biggest issuer of system calls by far. stapio may be polling too often.
After deep investigation by stap, I found a strange behaviour of utrace. At utrace_set_flags(), it calls quiesce() and set TIF_SIGPENDING flag to target thread, but utrace doesn't wake it up. If the target is waiting signal at that time(ex. sigwait), it fails to wakeup when a signal was sent because it already has TIF_SIGPENDING (see __group_complete_signal() and wants_signal() in kernel/signal.c). I think utrace_set_flags()/quiesce() should not use TIF_SIGPENDING.
Created attachment 2913 [details] signal sending trace logs Here is a pair of trace logs of stapio's signal handling. the normal part shows normal behaviour of signal sending. A SIGINT(2) was sent to stapio and __group_complete_signal() found stapio's signal thread(2035) and woke it up. This log also shows signal thread didn't have TIF_SIGPENDING. sigthread(2035){.state=1, .blocked=0, .flags=400040, .sigpending=0} the strange part shows abnormal behaviour of signal sending. A SIGINT(2) was sent to stapio, but __group_complete_signal() couldn't find stapio's signal thread(1984) because it had TIF_SIGPENDING. sigthread(1984){.state=1, .blocked=0, .flags=400040, .sigpending=1} As far as I can see, signal sending process seems expect that signal pending process is already running, not in sleep. Thank you,
I believe I can work around this problem by stopping/quiescing threads before attaching syscall utrace engines to them.
This should be fixed by commit 930a179.
I've added a test for this bug in utrace_p5.exp to make sure it stays fixed.