Bug 6707 - oops crashes with 2.6.25 - onoffprobe
Summary: oops crashes with 2.6.25 - onoffprobe
Status: RESOLVED WORKSFORME
Alias: None
Product: systemtap
Classification: Unclassified
Component: kprobes (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-06-30 10:38 UTC by Mark Wielaard
Modified: 2009-03-20 19:17 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Various collected oopses on the affected kernel (4.51 KB, text/plain)
2008-06-30 10:42 UTC, Mark Wielaard
Details
config-2.6.25.6-55.fc9.i686 (21.52 KB, text/plain)
2008-06-30 20:18 UTC, Mark Wielaard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Wielaard 2008-06-30 10:38:42 UTC
On x86 kernel 2.6.25 I regularly get hardware lockups caused by oopses when
running make installcheck. This is in particular with fedora 9 -
2.6.25.6-55.fc9.i686. The fedora 8 - 2.6.24 kernel was fine.

After setting up netconsole as described on
http://sourceware.org/systemtap/wiki/DeveloperSetupTips I collected various
oopses, which I will attached.

I cannot replicate this in a qemu-kvm environment, then there are some regular
make installcheck failures, but no kernel oopses. It only occurs on the real
hardware.

The test that most likely (but not always) triggers these oopses is
testsuite/systemtap.base/onoffprobe.stp, which can be explicitly run with make
installcheck RUNTESTFLAGS=onoffprobe.exp
Comment 1 Mark Wielaard 2008-06-30 10:42:12 UTC
Created attachment 2805 [details]
Various collected oopses on the affected kernel
Comment 2 Mark Wielaard 2008-06-30 12:49:34 UTC
Some more data points.

There were also oopses/lockups on this branch systemtap version: version
0.7/0.133 git branch pr6429-comp-unwindsyms, commit 3c02e16c.

With systemtap 0.6.2 however (systemtap version: version 0.6.2/0.133 built
2008-06-30) onoffprobe seems to work flawlessly every time. 
Comment 3 Mark Wielaard 2008-06-30 16:52:50 UTC
I narrowed it down to the following script:

global switch=-1

#begin probe
probe begin if (switch==-1) {
        log("begin1 probed");
}

probe begin if (switch==0) {
        log("begin2 probed");
}

#dwarf probe (return)
probe kernel.function("sys_write").return if (switch == 1) {
        log("function return probed")
        switch = 0
}

#dwarf probe (entry)
probe kernel.function("sys_write") if (switch == 2) {
        log("function entry probed")
        switch = 0
}

It looks like if I remove any of the probes, the conditions, or manipulate
switch in any other way, things don't hang.
So, what I expect is to see a bit more log output. But all I get when it hangs
is (run with /usr/local/systemtap/bin/stap -k -vv -DDEBUG_SYMBOLS=2
onoffprobe.stp -m onoffprobe):

begin1 probed
_stp_module_relocate:36: kernel, _stext, 805fd
_stp_module_relocate:36: kernel, _stext, 805fd

none of the other probes seem to log anything in that case.

This needs some more investigation.
Comment 4 Mark Wielaard 2008-06-30 19:49:10 UTC
(In reply to comment #3)
> I narrowed it down to the following script:
> 
> global switch=-1
> 
> #begin probe
> probe begin if (switch==-1) {
>         log("begin1 probed");
> }
> 
> probe begin if (switch==0) {
>         log("begin2 probed");
> }
> 
> #dwarf probe (return)
> probe kernel.function("sys_write").return if (switch == 1) {
>         log("function return probed")
>         switch = 0
> }
> 
> #dwarf probe (entry)
> probe kernel.function("sys_write") if (switch == 2) {
>         log("function entry probed")
>         switch = 0
> }
> 
> It looks like if I remove any of the probes, the conditions, or manipulate
> switch in any other way, things don't hang.

Also all the switch assignment statements and the log statements are necessary.
Remove any of them and things seem fine.
Comment 5 Masami Hiramatsu 2008-06-30 19:58:48 UTC
I tested it on my i686 PC with 2.6.25, but it didn't happen.
How frequently would it happen? and what kernel configuration would you set?
Comment 6 Mark Wielaard 2008-06-30 20:14:07 UTC
And the exact same script with systemtap-0.6.2 on the same setup/machine seems fine.
Comment 7 Mark Wielaard 2008-06-30 20:16:57 UTC
(In reply to comment #5)
> I tested it on my i686 PC with 2.6.25, but it didn't happen.
> How frequently would it happen?

It happens almost always.

> and what kernel configuration would you set?

This is with the stock/latest updated fedora 9 kernel. 2.6.25.6-55.fc9.i686
I'll attache the config file.
Comment 8 Mark Wielaard 2008-06-30 20:18:30 UTC
Created attachment 2809 [details]
config-2.6.25.6-55.fc9.i686
Comment 9 Ananth Mavinakayanahalli 2008-07-01 08:51:54 UTC
My system is running the exact same kernel, same config too, but I don't see the
crash. It just prints 'begin1 probed'. I can terminate the script and the system
is usable. No indication of any oops in dmesg either.

I've even been able to toggle switch in sys_write (to 1) and sys_write return
(to -1) to continue probing and the probes hit fine; the log does get printed
without problems.
Comment 10 Mark Wielaard 2008-07-01 08:58:29 UTC
(In reply to comment #9)
> My system is running the exact same kernel, same config too, but I don't see the
> crash. It just prints 'begin1 probed'. I can terminate the script and the system
> is usable. No indication of any oops in dmesg either.

Yeah :{ I am beginning to think this is just this one system. As I said, if I
setup the same system in a qemu-kvm environment I don't get any oops.

> I've even been able to toggle switch in sys_write (to 1) and sys_write return
> (to -1) to continue probing and the probes hit fine; the log does get printed
> without problems.

If I make the a similar change the script just works fine... Indeed, that
doesn't make sense, because then the script does even more stuff than in the
original script...
Comment 11 Masami Hiramatsu 2008-07-03 21:34:05 UTC
(In reply to comment #9)
> My system is running the exact same kernel, same config too, but I don't see the
> crash. It just prints 'begin1 probed'. I can terminate the script and the system
> is usable. No indication of any oops in dmesg either.

I also could not reproduce this bug on my i686(PentiumD SMP).
Based on the symptoms, I guess the module set timer handler wrong way, and
timer accessed wrong address. But I'm not sure how it can be happened.

So, would you checked your running kernel and kernel binary and kernel source are
same revision? Sometimes, executing make command after install kernel lose its
consistency.

Thank you,
Comment 12 Mark Wielaard 2008-07-04 10:29:23 UTC
(In reply to comment #11)
> So, would you checked your running kernel and kernel binary and kernel source are
> same revision? Sometimes, executing make command after install kernel lose its
> consistency.

It is the standard fc9 kernel and kernel-debuginfo packages. I have since
upgraded to the latest available:

# uname -a; rpm -q kernel kernel-debuginfo
Linux hermans.wildebeest.org 2.6.25.9-76.fc9.i686 #1 SMP Fri Jun 27 16:14:35 EDT
2008 i686 i686 i386 GNU/Linux
kernel-2.6.25.9-76.fc9.i686
kernel-debuginfo-2.6.25.9-76.fc9.i686

With this kernel the script from comment #3 does indeed work without freezing
the machine. Unfortunately the original script in
testsuite/systemtap.base/onoffprobe.stp still does.
Comment 13 Mark Wielaard 2009-03-20 19:17:54 UTC
I haven't seen this crash for a long time now on recent Fedora 10 kernels. e.g.
2.6.27.19-170.2.35.fc10.i686 and recent systemtap 0.9 or higher. onoffprobe.exp
always passes now.