#! stap global foo; probe kernel.function("*resume*") { foo ++ } it does produce an oops, i have a screen shot, thanks to running it in vmware http://www.blastwave.org/~jamesd/systemtap/systemtap_crash.png
Please specify exact kernel version, architecture, systemtap version, so we can attempt to recreate this problem.
Subject: Re: following non-guru mode script,crashes the system when ran. On 15 Oct 2005 01:00:54 -0000, fche at redhat dot com <sourceware-bugzilla@sourceware.org> wrote: > > ------- Additional Comments From fche at redhat dot com 2005-10-15 01:00 ------- > Please specify exact kernel version, architecture, systemtap version, > so we can attempt to recreate this problem. > Linux localhost.localdomain 2.6.12-1.1447_FC4 #1 Fri Aug 26 20:29:51 EDT 2005 i686 athlon i386 GNU/Linux fedora FC4, cvs up performed Friday 14, October. [jamesd@localhost ~]$ stap --version stap: invalid option -- - SystemTap translator/driver (version 0.4.1 built 2005-09-22) Copyright (C) 2005 Red Hat, Inc. This is free software; see the source for copying conditions. opteron running i386 kernel inside vmware, i have tested other scripts in the past and they worked fine. make check completes all tasks as expected. ======================================================== All 103 tests behaved as expected (38 expected failures) ======================================================== make[1]: Leaving directory `/root/src' [root@localhost src]# James > -- > What |Removed |Added > ---------------------------------------------------------------------------- > Summary|following non-guru mode |following non-guru mode > |script,crashes the system |script,crashes the system > |when ran. |when ran. > > > http://sourceware.org/bugzilla/show_bug.cgi?id=1482 > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. > >
Multiple problems here: probe kernel.function("*resume*") matches resume_execution() in kprobe.c. That probably shouldn't happen. But it does and the script attempts to set a kprobe on kprobe internals which correctly fails. Unfortunately systemtap_module_init doesn't handle probe registration failures correctly. (must have been a recent change) int systemtap_module_init () { int rc = 0; atomic_set (&session_state, STAP_SESSION_STARTING); global_foo = 0; rwlock_init (& global_foo_lock); /* register probe #0, 59 location(s) */ { int i; for (i = 0; i < 59; i++) { dwarf_kprobe_0[i].pre_handler = &dwarf_kprobe_0_enter; rc = register_kprobe (&(dwarf_kprobe_0[i])); if (unlikely (rc)) break; } if (unlikely (rc)) while (--i > 0) unregister_kprobe (&(dwarf_kprobe_0[i])); } if (unlikely (rc)) { atomic_set (&session_state, STAP_SESSION_ERROR); _stp_error ("probe 0 registration failed, rc=%d, %s\n", rc, "identifier 'probe' at crash.stp:5:1"); } if (atomic_read (&session_state) == STAP_SESSION_STARTING) atomic_set (&session_state, STAP_SESSION_RUNNING); goto out; out: return rc; } The problem is the stp_error() line. stp_error() should only be used after module_init() has succeeded. stp_warn() should be used instead. However, it is useless where it is because all information about which of the 59 probes failes to be registered has been lost by this point.
The problem wasn't really _stp_exit(). I assumed it was because I realized there could be a race condition in stp_exit and the module init code. Also, the problem seemed to disappear on one system when I removed the stp_exit() call. However the kernel crash was only delayed. I checked in a fix to prevent any possibility of stp_exit crashing the system by exiting while probe registrations were happening, even though it appeared to not be possible using systemtap. It did not fix this PR. The real reason for the crash was way too obvious. if (unlikely (rc)) while (--i > 0) unregister_kprobe (&(dwarf_kprobe_0[i])); Notice how dwarf_kprobe_0[0] is never unregistered??? I checked in a fix for that to tapsets.cxx and the problem went away. However I am not closing this because the _stp_error() really should be _stp_warn() and it really should give some useful message.
That new failure-handling logic in fact included a correction for another similar off-by-one bug. It's not easy enough to trigger those corner cases by hand. Anyway, I checked in an improvement to the error message printing for this case. While the original bug is now solved, I'm reassigning this PR back to hunt though for the following reason: stp_error and other runtime functions should be valid to call from within this (module_init) context, since "begin" probes run there. In fact, in this simple case: probe begin { error ("hello") } it's working fine. So please investigate whether in fact there is a problem, and if so, how it can be triggered by a test suite & fixed.
As I mentioned in comment #4, I already checked in a fix to stp_error() so it is safe. I could not trigger a crash in systemtap but was able to cause one using only the runtime. That is now fixed. I still think stp_warn() would be preferred in the module init code because there is no reason to initiate a shutdown when returning an error code from the function does the same thing more directly.