On x86 kernel 2.6.25 I regularly get hardware lockups caused by oopses when running make installcheck. This is in particular with fedora 9 - 2.6.25.6-55.fc9.i686. The fedora 8 - 2.6.24 kernel was fine. After setting up netconsole as described on http://sourceware.org/systemtap/wiki/DeveloperSetupTips I collected various oopses, which I will attached. I cannot replicate this in a qemu-kvm environment, then there are some regular make installcheck failures, but no kernel oopses. It only occurs on the real hardware. The test that most likely (but not always) triggers these oopses is testsuite/systemtap.base/onoffprobe.stp, which can be explicitly run with make installcheck RUNTESTFLAGS=onoffprobe.exp
Created attachment 2805 [details] Various collected oopses on the affected kernel
Some more data points. There were also oopses/lockups on this branch systemtap version: version 0.7/0.133 git branch pr6429-comp-unwindsyms, commit 3c02e16c. With systemtap 0.6.2 however (systemtap version: version 0.6.2/0.133 built 2008-06-30) onoffprobe seems to work flawlessly every time.
I narrowed it down to the following script: global switch=-1 #begin probe probe begin if (switch==-1) { log("begin1 probed"); } probe begin if (switch==0) { log("begin2 probed"); } #dwarf probe (return) probe kernel.function("sys_write").return if (switch == 1) { log("function return probed") switch = 0 } #dwarf probe (entry) probe kernel.function("sys_write") if (switch == 2) { log("function entry probed") switch = 0 } It looks like if I remove any of the probes, the conditions, or manipulate switch in any other way, things don't hang. So, what I expect is to see a bit more log output. But all I get when it hangs is (run with /usr/local/systemtap/bin/stap -k -vv -DDEBUG_SYMBOLS=2 onoffprobe.stp -m onoffprobe): begin1 probed _stp_module_relocate:36: kernel, _stext, 805fd _stp_module_relocate:36: kernel, _stext, 805fd none of the other probes seem to log anything in that case. This needs some more investigation.
(In reply to comment #3) > I narrowed it down to the following script: > > global switch=-1 > > #begin probe > probe begin if (switch==-1) { > log("begin1 probed"); > } > > probe begin if (switch==0) { > log("begin2 probed"); > } > > #dwarf probe (return) > probe kernel.function("sys_write").return if (switch == 1) { > log("function return probed") > switch = 0 > } > > #dwarf probe (entry) > probe kernel.function("sys_write") if (switch == 2) { > log("function entry probed") > switch = 0 > } > > It looks like if I remove any of the probes, the conditions, or manipulate > switch in any other way, things don't hang. Also all the switch assignment statements and the log statements are necessary. Remove any of them and things seem fine.
I tested it on my i686 PC with 2.6.25, but it didn't happen. How frequently would it happen? and what kernel configuration would you set?
And the exact same script with systemtap-0.6.2 on the same setup/machine seems fine.
(In reply to comment #5) > I tested it on my i686 PC with 2.6.25, but it didn't happen. > How frequently would it happen? It happens almost always. > and what kernel configuration would you set? This is with the stock/latest updated fedora 9 kernel. 2.6.25.6-55.fc9.i686 I'll attache the config file.
Created attachment 2809 [details] config-2.6.25.6-55.fc9.i686
My system is running the exact same kernel, same config too, but I don't see the crash. It just prints 'begin1 probed'. I can terminate the script and the system is usable. No indication of any oops in dmesg either. I've even been able to toggle switch in sys_write (to 1) and sys_write return (to -1) to continue probing and the probes hit fine; the log does get printed without problems.
(In reply to comment #9) > My system is running the exact same kernel, same config too, but I don't see the > crash. It just prints 'begin1 probed'. I can terminate the script and the system > is usable. No indication of any oops in dmesg either. Yeah :{ I am beginning to think this is just this one system. As I said, if I setup the same system in a qemu-kvm environment I don't get any oops. > I've even been able to toggle switch in sys_write (to 1) and sys_write return > (to -1) to continue probing and the probes hit fine; the log does get printed > without problems. If I make the a similar change the script just works fine... Indeed, that doesn't make sense, because then the script does even more stuff than in the original script...
(In reply to comment #9) > My system is running the exact same kernel, same config too, but I don't see the > crash. It just prints 'begin1 probed'. I can terminate the script and the system > is usable. No indication of any oops in dmesg either. I also could not reproduce this bug on my i686(PentiumD SMP). Based on the symptoms, I guess the module set timer handler wrong way, and timer accessed wrong address. But I'm not sure how it can be happened. So, would you checked your running kernel and kernel binary and kernel source are same revision? Sometimes, executing make command after install kernel lose its consistency. Thank you,
(In reply to comment #11) > So, would you checked your running kernel and kernel binary and kernel source are > same revision? Sometimes, executing make command after install kernel lose its > consistency. It is the standard fc9 kernel and kernel-debuginfo packages. I have since upgraded to the latest available: # uname -a; rpm -q kernel kernel-debuginfo Linux hermans.wildebeest.org 2.6.25.9-76.fc9.i686 #1 SMP Fri Jun 27 16:14:35 EDT 2008 i686 i686 i386 GNU/Linux kernel-2.6.25.9-76.fc9.i686 kernel-debuginfo-2.6.25.9-76.fc9.i686 With this kernel the script from comment #3 does indeed work without freezing the machine. Unfortunately the original script in testsuite/systemtap.base/onoffprobe.stp still does.
I haven't seen this crash for a long time now on recent Fedora 10 kernels. e.g. 2.6.27.19-170.2.35.fc10.i686 and recent systemtap 0.9 or higher. onoffprobe.exp always passes now.