Attempting to probe all kernel.function's hangs FC5 I'm not sure how realistic a test this is - but it is 100% reproducible on my Fedora Core 5 system. Invoking stap with this command hangs the system - power cycle required to clear: stap -v -p5 -e 'probe kernel.function("*") { printf ("%s %s %d %s ", "test-> ", execname(), pid(), probefunc() ) }' I'm not yet sure if one probe is causing the hang - or the accumulation of 10K+ probes. Systemtap version (built from CVS head 20060601): SystemTap translator/driver (version 0.5.7 built 2006-06-01) (Using Red Hat elfutils 0.120 libraries.) Kernel installed: kernel-devel-2.6.16-1.2122_FC5 kernel-debuginfo-2.6.16-1.2122_FC5 kernel-2.6.16-1.2122_FC5
Which architecture is this on? There are some differences between the i386 and x86_64 in setting up the probes. The x86_64 need executable regions of memory allocated. Depending on which architecture is being used there could be different failure modes. Is there any other information output? One trick is to just have a regular console, no X so you can see any oops that come up. Are there any entries in /var/log/messages related to this problem? It could be that this experiment is bumping against some limit. You might take a look at the output of the follow to see how many probes and the size of the result .ko: stap -v -p4 -k -e 'probe kernel.function("*") { printf ("%s %s %d %s ", "test-> ", execname(), pid(), probefunc() ) }' It is also possible that some function that should be blacklisted is getting instrumented. You might try to expand tests/testsuite/systemtap.stress/all_kernel_functions.exp to see if you can get a narrow down the boundary between working and crashing.
Broad wildcard probes are known to still trigger problems. This is the reason that the src/tapsets.cxx blacklist exists. One needs to narrow down failing wildcards as much as possible, perhaps to individual functions and/or files, then add them to the blacklist. When the job is done, function("*") will automagically exclude blacklisted ones and still give one the thousands of workable probes one might want (?). *** This bug has been marked as a duplicate of 1836 ***
I think it has to do with printing within the * probe. I tried the following on i686 RHEL4: probe begin { log("begin") } probe kernel.function("*") { } probe end { log("end") } This had no problem. The system was a bit slow, but still usable and reasonably responsive. Then I tried this: probe begin { log("begin") } probe kernel.function("*") { print(".") } probe end { log("end") } This hung the system. The simple print(".") will still create a lot of data -- perhaps something in the buffer flush is being overwhelmed, or causing a recursion?
Good point, we've had some progress since bug #1836. Other than reentrancy, we may also trigger stack exhaustion (bug #2685).
Additional data point: a few experiments probing kernel.function("*") with FC5 kernels indicates improvements but also problems. In all my tests, the machine *appears to hang* even with probe handlers that do nothing. The machine is actually running, just extremely slowly.
Thanks for all the suggestions feedback - I need to narrow this down - some more notes, answers to questions... Architecture = i386, T41 thinkpad (1GB RAM), Linux dhcp83-56.boston.redhat.com 2.6.16-1.2122_FC5 #1 Sun May 21 15:01:01 EDT 2006 i686 i686 i386 GNU/Linux It's a limit problem - 10K+ probes are being set: ------------ static char const * dwarf_kprobe_probe_613_location_names[10481] = { "kernel.function(\"run_init_process@init/main.c:666\")", "kernel.function(\"init@init/main.c:690\")", "kernel.function(\"rest_init@init/main.c:389\")", ------------ stap_2615.ko is large too: ------------ [root@dhcp83-56 stap0bMFBZ]# ll total 6552 -rw-r--r-- 1 root root 127 Jun 2 11:05 Makefile -rw-r--r-- 1 root root 1066809 Jun 2 11:05 stap_2615.c -rw-r--r-- 1 root root 2188299 Jun 2 11:05 stap_2615.ko -rw-r--r-- 1 root root 1929 Jun 2 11:05 stap_2615.mod.c -rw-r--r-- 1 root root 34008 Jun 2 11:05 stap_2615.mod.o -rw-r--r-- 1 root root 2155316 Jun 2 11:05 stap_2615.o -rw-r--r-- 1 root root 692476 Jun 2 11:05 stap-symbols.h -rw-r--r-- 1 root root 492534 Jun 2 11:05 symbols.sorted ------------ This script reliably hangs - or even reboots the system: ------------ #!/usr/bin/env stap probe begin { log("begin") } probe end { log("end") } probe kernel.function("*") { } ------------ Nothing interesting is being written to /var/messages
Subject: Re: Attempting to probe all kernel functions hangs FC5 I ran your script on 2.6.16-1.2122_FC5smp i686. It is running now on this system as I am sending this. Besides a slight slowdown, nothing bad is happening. Module size is 2.3M, or about half the size of the nvidia module on my system. OK, this is interesting. I hit ^C to terminate the script and I lost all console input in every window for several minutes. Firefox was running fine. I could read emails, but typing in a window did nothing. Then the script exited and everything is fast and responsive again. I wonder if this is related to the bug where the console starts repeating characters while scripts exit? Anyone else see that sometimes?
Logged into each machine remotely and ran the 20061117 snapshot with the following script: ./stap_testing_200611170930/install/bin/stap -v -p5 -e 'probe kernel.function("*") { printf ("%s %s %d %s ","test-> ", execname(), pid(), probefunc() ) }' fc6 i686 UP kernel: 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:36:14 EST 2006 i686 i686 i386 GNU/Linux Got the following output to the console when running the mentioned script: BUG: unable to handle kernel paging request atf printing eip: 8c60c02f *pde = 00000000 Oops: 0000 [#1] BUG: unable to handle kernel paging request at virtual address 8c60c02f printing eip: c0629c0b *pde = 00000000 Oops: 0000 [#2] BUG: unable to handle kernel NULL pointer dereference at virtual address 0000000 printing eip: c0629c0b *pde = 00000000 Recursive die() failure, output suppressed <0>BUG: spinlock lockup on CPU#0, staprun/2683, c066fbac (Not tainted) Got same output even when printf taken out of script. rawhide i686 SMP 2.6.18-1.2849.fc6PAE #1 SMP Fri Nov 10 13:27:10 EST 2006 i686 i686 i386 GNU/Linux The machine hangs with the script with the printf in the body. No output or oops or anything However, the following script didn't kill the machine: ./stap_testing_200611170930/install/bin/stap -v -p5 -e 'probe kernel.function("*") { }' It took significant time for the probes to be removed (about a minute). rawhide x86-64 Linux dhcp59-198.rdu.redhat.com 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:34:46 EST 2006 x86_64 x86_64 x86_64 GNU/Linux The machine oops with the printf script. Simplified the script. The following script killed the machine by setting off the watchdog timer: ./stap_testing_200611170930/install/bin/stap -v -p5 -e 'probe kernel.function("*") { }' Following transcribed from the screen: [<ffffffff802636c0>] do_nmi+0x45/0x63 [<ffffffff80262bb7>] nmi+0x7f/0x88 [<ffffffff80262bb8>] nmi+0x80/0x88 <<EOE>> [<ffffffff80262bb8>] nmi+0x80/0x88 Kernel panic- not syncing: Aiee, killing interrupt handler! Bug: warning at kernel/panic.c:137/panic() (Not tainted) Call Trace: [<ffffffff802691db>] show_trace+0x34/0x47 [<ffffffff802691e2>] dump_stack+0x12/0x17 [<ffffffff8028b933>] panic_0x1e3/0x1f4 [<ffffffff80214ee0>] do_exit+0x8c/0x8c2 [<ffffffff80262fcb>] sync_regs+0x0/0x67 [<ffffffff80678290>] DWARF2 unwinder stuck at 0xffffffff80678290 Leftover inexact backtrace: <NMI> [<ffffffff802635a>] nmi_watchdog_tick+0x105/0x1a6 [<ffffffff802632f7>] default_do_nmi+0x86/0x1de [<ffffffff802636c0>] do_nmi+0x45/0x63 [<ffffffff80262bb7>] nmi+0x7f/0x88 [<ffffffff80262bb8>] nmi+0x80/0x88 <<EOE>> [<ffffffff80262bb8>] nmi+0x80/0x88
This includes a variety of kernels, including rhel5/fc5/fc6. http://sources.redhat.com/ml/systemtap/2006-q4/msg00474.html
*** Bug 1836 has been marked as a duplicate of this bug. ***
Some recent findings: - with blacklist-based blacklists, it's easy to exclude many suspect areas - function(...).call probes are generally stabler than function(...) ones, which include (...).inline instances.
See scripts/kprobes_test/* for some tools generating trial kprobe modules.
Kernel bugs are believed to be responsible for the remaining occurrences of such crashes. See for example https://bugzilla.redhat.com/show_bug.cgi?id=655904
Some promising progress: https://lkml.org/lkml/2013/12/3/784 and related threads.
*** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla.