On an up-to-date F10 x86-32 box running the latest systemtap git tree, the script 'stap -vv functioncallcount.stp "*@mm/*.c" -c "sleep 1"' triggers a system reboot. The last bits of output before the reboot reads: Pass 5: starting run. Running /usr/local/bin/staprun -v -c 'sleep 1' /tmp/stapQNWySg/stap_eb8754548b37b0a64ada0bfd68a4128b_596994.ko stapio:start_cmd:175 blocking briefly This problem has also been seen on powerpc. Haven't seen it happen on x86_64 yet.
Confirmed to hang my machine with 2.6.27.12-170.2.5.fc10.i686
Could this be another instance of the vunmap kernel bug?
Confirmed on (In reply to comment #1) > Confirmed to hang my machine with 2.6.27.12-170.2.5.fc10.i686 Me too, and I've checked that kernel occurred double fault. --- stap_e49508536d0c61734a1be8954c26680a_596988: systemtap: 0.8/0.137, base: f8c92000, memory: 1057746+25950+4344+13600 data+text+ctx+net, probes: 4008 PANIC: double fault, gdt at ca00c000 [255 bytes] --- However, I also checked that 2.6.27.12-170.2.5.fc10.i686 kernel doesn't have the Nick's change. I also confirmed that this bug didn't happen on vanilla 2.6.27 kernel.
2.6.29-rc3 + pr9740 kernel doesn't cause this bug too.
Well, I do see hangs with 9740. So there is more at play than just that. Also this problem happens on powerpc that doesn't use code that's supposed to fix 9740.
With "__kmalloc " blacklisted + upstream kernel (2.6.29-rc3) works fine. but we still have problem with F10 kernel.
I confirmed NOT to hang the kernel 2.6.27.12-170.2.5.fc10.i686.PAE.
On 2.6.29-0.78.rc3.git5.fc11.i686 kernel, BUG: unable to handle kernel NULL pointer dereference at 00000075 IP: [<c07b6dcb>] .LC52+0x1634/0x12e73 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC last sysfs file: /sys/module/sunrpc/sections/__param Modules linked in: stap_2203 iptable_nat nf_nat nfs lockd nfs_acl auth_rpcgss sunrpc ipv6 dm_multipath ppdev pcspkr 8139cp 8139too mii i2c_piix4 i2c_core parport_pc parport ata_generic pata_acpi [last unloaded: scsi_wait_scan] Pid: 1858, comm: ntpd Not tainted (2.6.29-0.78.rc3.git5.fc11.i686 #1) EIP: 0060:[<c07b6dcb>] EFLAGS: 00210283 CPU: 0 EIP is at .LC52+0x1634/0x12e73 EAX: acd0b904 EBX: f8bc294c ECX: 00000001 EDX: 00000000 ESI: ef58ce04 EDI: c1ff15e0 EBP: ef58cd68 ESP: ef58cd54 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process ntpd (pid: 1858, ti=ef58c000 task=ef61d480 task.ti=ef58c000) Stack: 00200246 ef58cd80 c0848a10 00000000 ffffffff ef58cd88 c06e4662 ef58cdcc 00000002 c0849398 c0846690 ef58cdcc 00000002 ef58cdb4 c06e46c3 ffffffff 00000000 00000002 00000001 00000000 c06e468a ef58ce04 00000000 ef58ce6c Call Trace: [<c06e4662>] ? notifier_call_chain+0x49/0x71 [<c06e46c3>] ? __atomic_notifier_call_chain+0x39/0x5c [<c06e468a>] ? __atomic_notifier_call_chain+0x0/0x5c [<c06e46f2>] ? atomic_notifier_call_chain+0xc/0xe [<c0444acd>] ? notify_die+0x2d/0x2f [<c06e29f9>] ? do_int3+0x1f/0x71 [<c06e28f4>] ? int3+0x2c/0x34 [<c048007b>] ? try_set_zone_oom+0x1f/0x151 [<c048d49f>] ? might_fault+0x1/0x80 [<c04b108b>] ? set_fd_set+0x17/0x2e [<c04b1e22>] ? core_sys_select+0x17d/0x1ce [<c041bed2>] ? pvclock_clocksource_read+0x48/0xa3 [<c041bed2>] ? pvclock_clocksource_read+0x48/0xa3 [<c041b551>] ? kvm_clock_read+0x16/0x18 [<c0407f43>] ? sched_clock+0x8/0xb [<c044cc13>] ? lock_release_holdtime+0x30/0x131 [<c044fc0d>] ? lock_release_non_nested+0xa8/0x1a9 [<c044fc0d>] ? lock_release_non_nested+0xa8/0x1a9 [<c041bed2>] ? pvclock_clocksource_read+0x48/0xa3 [<c041b551>] ? kvm_clock_read+0x16/0x18 [<c0447ec0>] ? getnstimeofday+0x56/0xe4 [<c04b202a>] ? sys_select+0x6e/0x8c [<c0403af6>] ? syscall_call+0x7/0xb Code: 31 3d 25 64 20 70 69 6e 31 3d 25 64 20 61 70 69 63 32 3d 25 64 20 70 69 6e 32 3d 25 64 0a 00 3c 33 3e 2e 2e 4d 50 2d 42 49 4f 53 <20> 62 75 67 3a 20 38 32 35 34 20 74 69 6d 65 72 20 6e 6f 74 20 EIP: [<c07b6dcb>] .LC52+0x1634/0x12e73 SS:ESP 0068:ef58cd54
pr9740 workaround fixed this problem on (at least) 2.6.27.12-170.2.5.fc10.i686
Subject: Re: functioncallcount.stp causes system crash On Thu, Feb 05, 2009 at 09:07:34PM -0000, mhiramat at redhat dot com wrote: > > ------- Additional Comments From mhiramat at redhat dot com 2009-02-05 21:07 ------- > pr9740 workaround fixed this problem on (at least) 2.6.27.12-170.2.5.fc10.i686 I was able to see the crash even with pr9740 fix.
Subject: Re: functioncallcount.stp causes system crash > I was able to see the crash even with pr9740 fix. What I meant was that the alternatives.c fix is not sufficient to fix the problem. Have yet to test with the page fault fix.
Our observations so far: a. Powerpc instruction emulation had a bug. Patch submitted to fix it (http://ozlabs.org/pipermail/linuxppc-dev/2009-February/068062.html) b. functioncallcount.stp as it exists probes calls and inlines. That is a problem -- at least on powerpc with the upstream kernel, set/clear/*_bit (part of arch/powerpc/include/asm/bitops.h) shouldn't be probed. Its not clear how to add a set of inline functions in a header file in the blacklist in one go. c. Maybe its best not to allow routines in bitops.h to be traced? d. The functioncallcount.stp needs to be modified to just probe .call and not inlines. Comments?
With functioncallcount.stp restricted to .call, the test passes without problems on an upstream kernel.
(In reply to comment #12) > Our observations so far: > b. functioncallcount.stp as it exists probes calls and inlines. That is a > problem -- at least on powerpc with the upstream kernel, set/clear/*_bit (part > of arch/powerpc/include/asm/bitops.h) shouldn't be probed. It would be nice to get a backport of this patch into older popular kernels. > Its not clear how to > add a set of inline functions in a header file in the blacklist in one go. See tapsets.cxx:build_blacklist(). > d. The functioncallcount.stp needs to be modified to just probe .call and not > inlines. Sure.
Updated functioncallcount.stp. The blacklist in tapsets.cxx needs updates: --- tapsets.cxx | 4 ++++ 1 file changed, 4 insertions(+) Index: systemtap-11feb/tapsets.cxx =================================================================== --- systemtap-11feb.orig/tapsets.cxx +++ systemtap-11feb/tapsets.cxx @@ -3144,7 +3144,11 @@ dwarf_query::build_blacklist() blfile += "kernel/kprobes.c"; // first alternative, no "|" blfile += "|arch/.*/kernel/kprobes.c"; + // Older kernels need ... blfile += "|include/asm/io.h"; + // While newer ones need ... + blfile += "|arch/.*/include/asm/io.h"; + blfile += "|arch/.*/include/asm/bitops.h"; blfile += "|drivers/ide/ide-iops.c"; // XXX: it would be nice if these blacklisted functions were pulled I haven't committed the above -- need validation if bitops.h needs blacklisting for all archs or be restricted to powerpc. Comments?
Created attachment 3730 [details] Updated patch for the blacklist... Updated the patch above.
275a898f4 committed
Commit eef336189 in the upstream kernel is the companion fix for powerpc's instruction emulation.
Subject: Re: functioncallcount.stp causes system crash On Tue, Feb 10, 2009 at 03:54:54PM -0000, fche at redhat dot com wrote: > It would be nice to get a backport of this patch into older popular kernels. Frank, Were you referring to the powerpc instruction emulation bug? If so, that bug doesn't exist in any of the EL kernels so far AFAICS. Ananth.