Bug 1482 - buggy (un)registration logic causes kernel crash
Summary: buggy (un)registration logic causes kernel crash
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: translator (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Martin Hunt
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-15 00:00 UTC by James Dickens
Modified: 2005-10-18 16:17 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description James Dickens 2005-10-15 00:00:22 UTC
#! stap

 global foo;

 probe kernel.function("*resume*") {
     foo ++
 }

it does produce an oops, i have a screen shot, thanks to running it in vmware

http://www.blastwave.org/~jamesd/systemtap/systemtap_crash.png
Comment 1 Frank Ch. Eigler 2005-10-15 01:00:54 UTC
Please specify exact kernel version, architecture, systemtap version,
so we can attempt to recreate this problem.
Comment 2 James Dickens 2005-10-15 06:58:30 UTC
Subject: Re:  following non-guru mode script,crashes the system when ran.

On 15 Oct 2005 01:00:54 -0000, fche at redhat dot com
<sourceware-bugzilla@sourceware.org> wrote:
>
> ------- Additional Comments From fche at redhat dot com  2005-10-15 01:00 -------
> Please specify exact kernel version, architecture, systemtap version,
> so we can attempt to recreate this problem.
>

Linux localhost.localdomain 2.6.12-1.1447_FC4 #1 Fri Aug 26 20:29:51
EDT 2005 i686 athlon i386 GNU/Linux

fedora FC4,
cvs up performed Friday 14, October.

[jamesd@localhost ~]$ stap --version
stap: invalid option -- -
SystemTap translator/driver (version 0.4.1 built 2005-09-22)
Copyright (C) 2005 Red Hat, Inc.
This is free software; see the source for copying conditions.

opteron running  i386 kernel inside vmware, i have tested other
scripts in the past and they worked fine.

make check completes all tasks as expected.

========================================================
All 103 tests behaved as expected (38 expected failures)
========================================================
make[1]: Leaving directory `/root/src'
[root@localhost src]#

James




> --
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>             Summary|following non-guru mode     |following non-guru mode
>                    |script,crashes the system   |script,crashes the system
>                    |when ran.                   |when ran.
>
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=1482
>
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
>
>
Comment 3 Martin Hunt 2005-10-17 18:05:12 UTC
Multiple problems here:

probe kernel.function("*resume*") matches resume_execution() in kprobe.c.
That probably shouldn't happen. But it does and the script attempts to set a
kprobe on kprobe internals which correctly fails.

Unfortunately systemtap_module_init doesn't handle probe registration failures
correctly. (must have been a recent change)

int systemtap_module_init () {
  int rc = 0;
  atomic_set (&session_state, STAP_SESSION_STARTING);
  global_foo = 0;
  rwlock_init (& global_foo_lock);
  /* register probe #0, 59 location(s) */
  {
    int i;
    for (i = 0; i < 59; i++) {
      dwarf_kprobe_0[i].pre_handler = &dwarf_kprobe_0_enter;
      rc = register_kprobe (&(dwarf_kprobe_0[i]));
      if (unlikely (rc)) break;
    }
    if (unlikely (rc)) while (--i > 0)
      unregister_kprobe (&(dwarf_kprobe_0[i]));
  }
  if (unlikely (rc)) {
    atomic_set (&session_state, STAP_SESSION_ERROR);
    _stp_error ("probe 0 registration failed, rc=%d, %s\n", rc, "identifier
'probe' at crash.stp:5:1");
  }
  if (atomic_read (&session_state) == STAP_SESSION_STARTING)
    atomic_set (&session_state, STAP_SESSION_RUNNING);
  goto out;
  
  
out:
  return rc;
}

The problem is the stp_error() line. stp_error() should only be used after
module_init() has succeeded. stp_warn() should be used instead. However, it is
useless where it is because all information about which of the 
59 probes failes to be registered has been lost by this point.

Comment 4 Martin Hunt 2005-10-18 04:42:03 UTC
The problem wasn't really _stp_exit(). I assumed it was because I realized there
could be a race condition in stp_exit and the module init code. Also, the
problem seemed to disappear on one system when I removed the stp_exit() call.
However the kernel crash was only delayed.

I checked in a fix to prevent any possibility of stp_exit crashing the system by
exiting while probe registrations were happening, even though it appeared to not
be possible using systemtap. It did not fix this PR.

The real reason for the crash was way too obvious.
if (unlikely (rc)) while (--i > 0)
      unregister_kprobe (&(dwarf_kprobe_0[i]));

Notice how dwarf_kprobe_0[0] is never unregistered???
I checked in a fix for that to tapsets.cxx and the problem went away.

However I am not closing this because the _stp_error() really should be
_stp_warn() and it really should give some useful message.




Comment 5 Frank Ch. Eigler 2005-10-18 12:46:11 UTC
That new failure-handling logic in fact included a correction for another
similar off-by-one bug.  It's not easy enough to trigger those corner cases by hand.

Anyway, I checked in an improvement to the error message printing for this case.

While the original bug is now solved, I'm reassigning this PR back to hunt
though for the following reason: stp_error and other runtime functions should be
valid to call from within this (module_init) context, since "begin" probes run
there.  In fact, in this simple case:
  probe begin { error ("hello") }
it's working fine.  So please investigate whether in fact there is a problem,
and if so, how it can be triggered by a test suite & fixed.
Comment 6 Martin Hunt 2005-10-18 16:17:48 UTC
As I mentioned in comment #4, I already checked in a fix to stp_error() so it is
safe.  I could not trigger a crash in systemtap but was able to cause one using
only the runtime. That is now fixed.

I still think stp_warn() would be preferred in the module init code because
there is no reason to initiate a shutdown when returning an error code from the
function does the same thing more directly.