Bug 23891 - stap and stapio process are stuck in signal processor and could not terminate properly
Summary: stap and stapio process are stuck in signal processor and could not terminate...
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: translator (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-16 04:56 UTC by Datong Sun
Modified: 2018-11-26 16:25 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Datong Sun 2018-11-16 04:56:42 UTC
Hi,

We have observed that sometimes the stap (and associated stapio) process could not be terminated using SIGTERM on one of our production systems:

strace shows the following:

+ timeout 5s strace -p 8596 -s 1024
Process 8596 attached
write(2, "Too many interrupts received, exiting.\n", 39Process 8596 detached
 <detached ...>
+ true
+ timeout 5s strace -p 9110 -s 1024
Process 9110 attached
write(2, "WARNING:", 8Process 9110 detached
 <detached ...>
+ true

The full backtrace of the stap and stapio process:

+ gstack 8596
#0  0x00007f8cc36bc420 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000412308 in handle_interrupt () at main.cxx:280
#2  <signal handler called>
#3  0x00007f8cc36bceca in __libc_waitpid (pid=9110, stat_loc=0x7ffe7df20abc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:31
#4  0x0000000000597d8a in stap_waitpid (verbose=0, pid=9110) at util.cxx:749
#5  0x0000000000606cd9 in direct::finish (this=0x23fc830) at remote.cxx:108
#6  0x0000000000603222 in remote::run (remotes=std::vector of length 1, capacity 1 = {...}) at remote.cxx:1292
#7  0x00000000004126cc in pass_5 (s=..., targets=std::vector of length 1, capacity 1 = {...}) at main.cxx:1209
#8  0x000000000040fa54 in main (argc=<optimized out>, argv=<optimized out>) at main.cxx:1429
+ gstack 9110
Thread 3 (Thread 0x7f2ad693d700 (LWP 9112)):
#0  0x00007f2ad6d0e101 in do_sigwait (sig=0x7f2ad693cef4, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61
#1  __sigwait (set=set@entry=0x628170, sig=sig@entry=0x7f2ad693cef4) at ../sysdeps/unix/sysv/linux/sigwait.c:99
#2  0x0000000000402db5 in signal_thread (arg=0x628170) at mainloop.c:40
#3  0x00007f2ad6d06dc5 in start_thread (arg=0x7f2ad693d700) at pthread_create.c:308
#4  0x00007f2ad6a3573d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 2 (Thread 0x7f2ad613c700 (LWP 9113)):
#0  0x00007f2ad6d0d43d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000407618 in reader_thread (data=0x0) at relay.c:235
#2  0x00007f2ad6d06dc5 in start_thread (arg=0x7f2ad613c700) at pthread_create.c:308
#3  0x00007f2ad6a3573d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7f2ad712b740 (LWP 9110)):
#0  0x00007f2ad6a26c7d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f2ad69b3f73 in _IO_new_file_write (f=0x7f2ad6cf91c0 <_IO_2_1_stderr_>, data=0x7ffe0aaa9210, n=8) at fileops.c:1301
#2  0x00007f2ad69b470f in new_do_write (to_do=8, data=0x7ffe0aaa9210 "WARNING:acing...\n", fp=0x7f2ad6cf91c0 <_IO_2_1_stderr_>) at fileops.c:537
#3  _IO_new_file_xsputn (f=0x7f2ad6cf91c0 <_IO_2_1_stderr_>, data=<optimized out>, n=8) at fileops.c:1383
#4  0x00007f2ad698a47d in buffered_vfprintf (s=s@entry=0x7f2ad6cf91c0 <_IO_2_1_stderr_>, format=format@entry=0x40a648 "WARNING:", args=args@entry=0x7ffe0aaab850) at vfprintf.c:2340
#5  0x00007f2ad698531e in _IO_vfprintf_internal (s=s@entry=0x7f2ad6cf91c0 <_IO_2_1_stderr_>, format=0x40a648 "WARNING:", ap=0x7ffe0aaab850) at vfprintf.c:1289
#6  0x00007f2ad6a4aefd in ___vfprintf_chk (fp=0x7f2ad6cf91c0 <_IO_2_1_stderr_>, flag=flag@entry=1, format=<optimized out>, ap=ap@entry=0x7ffe0aaab850) at vfprintf_chk.c:34
#7  0x0000000000405172 in vfprintf (__ap=0x7ffe0aaab850, __fmt=<optimized out>, __stream=<optimized out>) at /usr/include/bits/stdio2.h:127
#8  eprintf (fmt=<optimized out>) at common.c:693
#9  0x0000000000404c55 in stp_main_loop () at mainloop.c:810
#10 0x000000000040279d in main (argc=<optimized out>, argv=0x7ffe0aaadc68) at stapio.c:73

At this point, sending SIGTERM to neither of the process works, and the Kernel module remains loaded.

Best,
Datong
Comment 1 Frank Ch. Eigler 2018-11-16 14:02:20 UTC
Any idea how to reproduce this?
The two last thread backtraces seem to be sitting inside a write(2), which suggests a pipe-full or such condition on the stapio stdout files.
Comment 2 agentzh 2018-11-24 00:15:34 UTC
Already committed a fix to master. Feel free to close this PR.
Comment 3 Frank Ch. Eigler 2018-11-26 16:25:44 UTC
commit ab368ac2a