Bug 2725

Summary: function("*") probes sometimes crash & burn
Product: systemtap Reporter: Len DiMaggio <ldimaggi>
Component: kprobesAssignee: Unassigned <systemtap>
Status: SUSPENDED ---    
Severity: critical CC: fche
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed: 2006-11-20 21:44:38

Description Len DiMaggio 2006-06-01 18:45:24 UTC
Attempting to probe all kernel.function's hangs FC5

I'm not sure how realistic a test this is - but it is 100% reproducible on my
Fedora Core 5 system.

Invoking stap with this command hangs the system - power cycle required to clear:

stap -v -p5 -e 'probe kernel.function("*")  { printf ("%s  %s  %d  %s  ",
"test-> ", execname(), pid(),  probefunc() ) }'

I'm not yet sure if one probe is causing the hang - or the accumulation of 10K+
probes. 

Systemtap version (built from CVS head 20060601):
  SystemTap translator/driver (version 0.5.7 built 2006-06-01)
  (Using Red Hat elfutils 0.120 libraries.)

Kernel installed:
  kernel-devel-2.6.16-1.2122_FC5
  kernel-debuginfo-2.6.16-1.2122_FC5 
  kernel-2.6.16-1.2122_FC5
Comment 1 William Cohen 2006-06-01 19:39:30 UTC
Which architecture is this on? There are some differences between the i386 and
x86_64 in setting up the probes. The x86_64 need executable regions of memory
allocated. Depending on which architecture is being used there could be
different failure modes.

Is there any other information output? One trick is to just have a regular
console, no X so you can see any oops that come up. Are there any entries in
/var/log/messages related to this problem?

It could be that this experiment is bumping against some limit. You might take a
look at the output of the follow to see how many probes and the size of the
result .ko:

stap -v -p4 -k -e 'probe kernel.function("*")  { printf ("%s  %s  %d  %s  ",
"test-> ", execname(), pid(),  probefunc() ) }'

It is also possible that some function that should be blacklisted is getting
instrumented. You might try to expand
tests/testsuite/systemtap.stress/all_kernel_functions.exp to see if you can get
a narrow down the boundary between working and crashing.
Comment 2 Frank Ch. Eigler 2006-06-01 20:46:27 UTC
Broad wildcard probes are known to still trigger problems.  This is the reason
that the src/tapsets.cxx blacklist exists.  One needs to narrow down failing
wildcards as much as possible, perhaps to individual functions and/or files,
then add them to the blacklist.  When the job is done, function("*") will
automagically exclude blacklisted ones and still give one the thousands of
workable probes one might want (?).


*** This bug has been marked as a duplicate of 1836 ***
Comment 3 Josh Stone 2006-06-01 21:01:53 UTC
I think it has to do with printing within the * probe.  I tried the following on
i686 RHEL4:

  probe begin { log("begin") }
  probe kernel.function("*") { }
  probe end { log("end") }

This had no problem.  The system was a bit slow, but still usable and reasonably
responsive.  Then I tried this:

  probe begin { log("begin") }
  probe kernel.function("*") { print(".") }
  probe end { log("end") }

This hung the system.  The simple print(".") will still create a lot of data --
perhaps something in the buffer flush is being overwhelmed, or causing a recursion?
Comment 4 Frank Ch. Eigler 2006-06-01 22:21:42 UTC
Good point, we've had some progress since bug #1836.
Other than reentrancy, we may also trigger stack exhaustion (bug #2685).
Comment 5 Frank Ch. Eigler 2006-06-02 14:30:19 UTC
Additional data point: a few experiments probing kernel.function("*")
with FC5 kernels indicates improvements but also problems.  In all
my tests, the machine *appears to hang* even with probe handlers that
do nothing.  The machine is actually running, just extremely slowly.
Comment 6 Len DiMaggio 2006-06-02 15:33:29 UTC
Thanks for all the suggestions feedback - I need to narrow this down - some more
notes, answers to questions...

Architecture = i386, T41 thinkpad (1GB RAM), 
Linux dhcp83-56.boston.redhat.com 2.6.16-1.2122_FC5 #1 Sun May 21 15:01:01 EDT
2006 i686 i686 i386 GNU/Linux

It's a limit problem - 10K+ probes are being set:
------------
static char const * dwarf_kprobe_probe_613_location_names[10481] = {
  "kernel.function(\"run_init_process@init/main.c:666\")",
  "kernel.function(\"init@init/main.c:690\")",
  "kernel.function(\"rest_init@init/main.c:389\")",
------------

stap_2615.ko is large too:
------------
[root@dhcp83-56 stap0bMFBZ]# ll
total 6552
-rw-r--r-- 1 root root     127 Jun  2 11:05 Makefile
-rw-r--r-- 1 root root 1066809 Jun  2 11:05 stap_2615.c
-rw-r--r-- 1 root root 2188299 Jun  2 11:05 stap_2615.ko
-rw-r--r-- 1 root root    1929 Jun  2 11:05 stap_2615.mod.c
-rw-r--r-- 1 root root   34008 Jun  2 11:05 stap_2615.mod.o
-rw-r--r-- 1 root root 2155316 Jun  2 11:05 stap_2615.o
-rw-r--r-- 1 root root  692476 Jun  2 11:05 stap-symbols.h
-rw-r--r-- 1 root root  492534 Jun  2 11:05 symbols.sorted
------------

This script reliably hangs - or even reboots the system:
------------
#!/usr/bin/env stap 
probe begin { 
  log("begin") 
}
probe end { 
  log("end") 
}
probe kernel.function("*") {
}
------------

Nothing interesting is being written to /var/messages
Comment 7 Martin Hunt 2006-06-02 18:33:32 UTC
Subject: Re:  Attempting to probe all kernel functions
	hangs FC5

I ran your script on 2.6.16-1.2122_FC5smp i686.  It is running now on
this system as I am sending this. Besides a slight slowdown, nothing bad
is happening. Module size is 2.3M, or about half the size of the nvidia
module on my system. OK, this is interesting. I hit ^C to terminate the
script and  I lost all console input in every window for several
minutes. Firefox was running fine. I could read emails, but typing in a
window did nothing. Then the script exited and everything is fast and
responsive again.

I wonder if this is related to the bug where the console starts
repeating characters while scripts exit? Anyone else see that sometimes?




Comment 8 William Cohen 2006-11-17 15:40:33 UTC
Logged into each machine remotely and ran the 20061117 snapshot with the
following script:

  ./stap_testing_200611170930/install/bin/stap -v -p5 -e 'probe
kernel.function("*")  { printf ("%s  %s  %d  %s  ","test-> ", execname(), pid(),
 probefunc() ) }'



fc6 i686 UP
kernel: 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:36:14 EST 2006 i686 i686 i386
GNU/Linux

Got the following output to the console when running the mentioned script:

BUG: unable to handle kernel paging request atf printing eip:
8c60c02f
*pde = 00000000
Oops: 0000 [#1]
BUG: unable to handle kernel paging request at virtual address 8c60c02f
 printing eip:
c0629c0b
*pde = 00000000
Oops: 0000 [#2]
BUG: unable to handle kernel NULL pointer dereference at virtual address 0000000
printing eip:
c0629c0b
*pde = 00000000
Recursive die() failure, output suppressed
 <0>BUG: spinlock lockup on CPU#0, staprun/2683, c066fbac (Not tainted)

Got same output even when printf taken out of script.



rawhide i686 SMP
 2.6.18-1.2849.fc6PAE #1 SMP Fri Nov 10 13:27:10 EST 2006 i686 i686 i386 GNU/Linux

The machine hangs with the script with the printf in the body. No
output or oops or anything However, the following script didn't kill
the machine:

./stap_testing_200611170930/install/bin/stap -v -p5 -e 'probe
kernel.function("*")  { }'

It took significant time for the probes to be removed (about a minute).



rawhide x86-64

Linux dhcp59-198.rdu.redhat.com 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:34:46 EST
2006 x86_64 x86_64 x86_64 GNU/Linux

The machine oops with the printf script. Simplified the script. 
The following script killed the machine by setting off the watchdog timer:

./stap_testing_200611170930/install/bin/stap -v -p5 -e 'probe
kernel.function("*")  { }'
Following transcribed from the screen:

 [<ffffffff802636c0>] do_nmi+0x45/0x63
 [<ffffffff80262bb7>] nmi+0x7f/0x88
 [<ffffffff80262bb8>] nmi+0x80/0x88
 <<EOE>>   [<ffffffff80262bb8>] nmi+0x80/0x88

Kernel panic- not syncing: Aiee, killing interrupt handler!
 Bug: warning at kernel/panic.c:137/panic() (Not tainted)

Call Trace:
 [<ffffffff802691db>] show_trace+0x34/0x47
 [<ffffffff802691e2>] dump_stack+0x12/0x17
 [<ffffffff8028b933>] panic_0x1e3/0x1f4
 [<ffffffff80214ee0>] do_exit+0x8c/0x8c2
 [<ffffffff80262fcb>] sync_regs+0x0/0x67
 [<ffffffff80678290>]
DWARF2 unwinder stuck at 0xffffffff80678290
Leftover inexact backtrace:
 <NMI>  [<ffffffff802635a>] nmi_watchdog_tick+0x105/0x1a6
 [<ffffffff802632f7>] default_do_nmi+0x86/0x1de
 [<ffffffff802636c0>] do_nmi+0x45/0x63
 [<ffffffff80262bb7>] nmi+0x7f/0x88
 [<ffffffff80262bb8>] nmi+0x80/0x88
 <<EOE>>   [<ffffffff80262bb8>] nmi+0x80/0x88


Comment 9 Frank Ch. Eigler 2006-11-20 21:44:38 UTC
This includes a variety of kernels, including rhel5/fc5/fc6.

http://sources.redhat.com/ml/systemtap/2006-q4/msg00474.html
Comment 10 Frank Ch. Eigler 2006-11-20 21:47:17 UTC
*** Bug 1836 has been marked as a duplicate of this bug. ***
Comment 11 Frank Ch. Eigler 2008-02-19 19:31:22 UTC
Some recent findings:
- with blacklist-based blacklists, it's easy to exclude many suspect areas
- function(...).call probes are generally stabler than function(...) ones,
  which include (...).inline instances.
Comment 12 Frank Ch. Eigler 2009-03-20 19:01:16 UTC
See scripts/kprobes_test/* for some tools generating trial kprobe modules.
Comment 13 Frank Ch. Eigler 2012-08-02 16:30:10 UTC
Kernel bugs are believed to be responsible for the remaining occurrences of such crashes.  See for example https://bugzilla.redhat.com/show_bug.cgi?id=655904
Comment 14 Frank Ch. Eigler 2013-12-04 18:23:06 UTC
Some promising progress: https://lkml.org/lkml/2013/12/3/784 and related threads.
Comment 15 Jackie Rosen 2014-02-16 17:51:22 UTC Comment hidden (spam)