Bug 9816 - functioncallcount.stp causes system crash
Summary: functioncallcount.stp causes system crash
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: translator (show other bugs)
Version: unspecified
: P2 critical
Target Milestone: ---
Assignee: Ananth Mavinakayanahalli
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-04 11:34 UTC by Ananth Mavinakayanahalli
Modified: 2009-02-12 14:35 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Updated patch for the blacklist... (335 bytes, patch)
2009-02-11 06:18 UTC, Ananth Mavinakayanahalli
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ananth Mavinakayanahalli 2009-02-04 11:34:14 UTC
On an up-to-date F10 x86-32 box running the latest systemtap git tree, the
script 'stap -vv functioncallcount.stp "*@mm/*.c" -c "sleep 1"' triggers a
system reboot. The last bits of output before the reboot reads:

Pass 5: starting run.
Running /usr/local/bin/staprun -v -c 'sleep 1'
/tmp/stapQNWySg/stap_eb8754548b37b0a64ada0bfd68a4128b_596994.ko
stapio:start_cmd:175 blocking briefly

This problem has also been seen on powerpc. Haven't seen it happen on x86_64 yet.
Comment 1 Mark Wielaard 2009-02-04 11:43:52 UTC
Confirmed to hang my machine with 2.6.27.12-170.2.5.fc10.i686
Comment 2 Frank Ch. Eigler 2009-02-04 17:09:39 UTC
Could this be another instance of the vunmap kernel bug?
Comment 3 Masami Hiramatsu 2009-02-04 23:14:21 UTC
Confirmed on (In reply to comment #1)
> Confirmed to hang my machine with 2.6.27.12-170.2.5.fc10.i686

Me too, and I've checked that kernel occurred double fault.
---
stap_e49508536d0c61734a1be8954c26680a_596988: systemtap: 0.8/0.137, base:
f8c92000, memory: 1057746+25950+4344+13600 data+text+ctx+net, probes: 4008
PANIC: double fault, gdt at ca00c000 [255 bytes]
---

However, I also checked that 2.6.27.12-170.2.5.fc10.i686 kernel doesn't have
the Nick's change.

I also confirmed that this bug didn't happen on vanilla 2.6.27 kernel.
Comment 4 Masami Hiramatsu 2009-02-04 23:48:17 UTC
2.6.29-rc3 + pr9740 kernel doesn't cause this bug too.
Comment 5 Ananth Mavinakayanahalli 2009-02-05 05:07:27 UTC
Well, I do see hangs with 9740. So there is more at play than just that. Also
this problem happens on powerpc that doesn't use code that's supposed to fix 9740.
Comment 6 Mahesh J Salgaonkar 2009-02-05 12:46:50 UTC
With "__kmalloc " blacklisted + upstream kernel (2.6.29-rc3) works fine. but we
still have problem with F10 kernel.
Comment 7 Masami Hiramatsu 2009-02-05 17:25:40 UTC
I confirmed NOT to hang the kernel 2.6.27.12-170.2.5.fc10.i686.PAE.
Comment 8 Frank Ch. Eigler 2009-02-05 19:02:10 UTC
On 2.6.29-0.78.rc3.git5.fc11.i686 kernel,

BUG: unable to handle kernel NULL pointer dereference at 00000075
IP: [<c07b6dcb>] .LC52+0x1634/0x12e73
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
last sysfs file: /sys/module/sunrpc/sections/__param
Modules linked in: stap_2203 iptable_nat nf_nat nfs lockd nfs_acl auth_rpcgss
sunrpc ipv6 dm_multipath ppdev pcspkr 8139cp 8139too mii i2c_piix4 i2c_core
parport_pc parport ata_generic pata_acpi [last unloaded: scsi_wait_scan]

Pid: 1858, comm: ntpd Not tainted (2.6.29-0.78.rc3.git5.fc11.i686 #1) 
EIP: 0060:[<c07b6dcb>] EFLAGS: 00210283 CPU: 0
EIP is at .LC52+0x1634/0x12e73
EAX: acd0b904 EBX: f8bc294c ECX: 00000001 EDX: 00000000
ESI: ef58ce04 EDI: c1ff15e0 EBP: ef58cd68 ESP: ef58cd54
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process ntpd (pid: 1858, ti=ef58c000 task=ef61d480 task.ti=ef58c000)
Stack:
 00200246 ef58cd80 c0848a10 00000000 ffffffff ef58cd88 c06e4662 ef58cdcc
 00000002 c0849398 c0846690 ef58cdcc 00000002 ef58cdb4 c06e46c3 ffffffff
 00000000 00000002 00000001 00000000 c06e468a ef58ce04 00000000 ef58ce6c
Call Trace:
 [<c06e4662>] ? notifier_call_chain+0x49/0x71
 [<c06e46c3>] ? __atomic_notifier_call_chain+0x39/0x5c
 [<c06e468a>] ? __atomic_notifier_call_chain+0x0/0x5c
 [<c06e46f2>] ? atomic_notifier_call_chain+0xc/0xe
 [<c0444acd>] ? notify_die+0x2d/0x2f
 [<c06e29f9>] ? do_int3+0x1f/0x71
 [<c06e28f4>] ? int3+0x2c/0x34
 [<c048007b>] ? try_set_zone_oom+0x1f/0x151
 [<c048d49f>] ? might_fault+0x1/0x80
 [<c04b108b>] ? set_fd_set+0x17/0x2e
 [<c04b1e22>] ? core_sys_select+0x17d/0x1ce
 [<c041bed2>] ? pvclock_clocksource_read+0x48/0xa3
 [<c041bed2>] ? pvclock_clocksource_read+0x48/0xa3
 [<c041b551>] ? kvm_clock_read+0x16/0x18
 [<c0407f43>] ? sched_clock+0x8/0xb
 [<c044cc13>] ? lock_release_holdtime+0x30/0x131
 [<c044fc0d>] ? lock_release_non_nested+0xa8/0x1a9
 [<c044fc0d>] ? lock_release_non_nested+0xa8/0x1a9
 [<c041bed2>] ? pvclock_clocksource_read+0x48/0xa3
 [<c041b551>] ? kvm_clock_read+0x16/0x18
 [<c0447ec0>] ? getnstimeofday+0x56/0xe4
 [<c04b202a>] ? sys_select+0x6e/0x8c
 [<c0403af6>] ? syscall_call+0x7/0xb
Code: 31 3d 25 64 20 70 69 6e 31 3d 25 64 20 61 70 69 63 32 3d 25 64 20 70 69 6e
32 3d 25 64 0a 00 3c 33 3e 2e 2e 4d 50 2d 42 49 4f 53 <20> 62 75 67 3a 20 38 32
35 34 20 74 69 6d 65 72 20 6e 6f 74 20 
EIP: [<c07b6dcb>] .LC52+0x1634/0x12e73 SS:ESP 0068:ef58cd54
Comment 9 Masami Hiramatsu 2009-02-05 21:07:33 UTC
pr9740 workaround fixed this problem on (at least) 2.6.27.12-170.2.5.fc10.i686
Comment 10 Ananth Mavinakayanahalli 2009-02-06 01:44:49 UTC
Subject: Re:  functioncallcount.stp causes system crash

On Thu, Feb 05, 2009 at 09:07:34PM -0000, mhiramat at redhat dot com wrote:
> 
> ------- Additional Comments From mhiramat at redhat dot com  2009-02-05 21:07 -------
> pr9740 workaround fixed this problem on (at least) 2.6.27.12-170.2.5.fc10.i686

I was able to see the crash even with pr9740 fix.
Comment 11 Ananth Mavinakayanahalli 2009-02-06 03:00:28 UTC
Subject: Re:  functioncallcount.stp causes system crash

> I was able to see the crash even with pr9740 fix.

What I meant was that the alternatives.c fix is not sufficient to fix
the problem. Have yet to test with the page fault fix.
Comment 12 Ananth Mavinakayanahalli 2009-02-10 10:17:23 UTC
Our observations so far:
a. Powerpc instruction emulation had a bug. Patch submitted to fix it
(http://ozlabs.org/pipermail/linuxppc-dev/2009-February/068062.html)
b. functioncallcount.stp as it exists probes calls and inlines. That is a
problem -- at least on powerpc with the upstream kernel, set/clear/*_bit (part
of arch/powerpc/include/asm/bitops.h) shouldn't be probed. Its not clear how to
add a set of inline functions in a header file in the blacklist in one go.
c. Maybe its best not to allow routines in bitops.h to be traced?
d. The functioncallcount.stp needs to be modified to just probe .call and not
inlines.

Comments?
Comment 13 Ananth Mavinakayanahalli 2009-02-10 10:19:28 UTC
With functioncallcount.stp restricted to .call, the test passes without problems
on an upstream kernel.
Comment 14 Frank Ch. Eigler 2009-02-10 15:54:53 UTC
(In reply to comment #12)
> Our observations so far:
> b. functioncallcount.stp as it exists probes calls and inlines. That is a
> problem -- at least on powerpc with the upstream kernel, set/clear/*_bit (part
> of arch/powerpc/include/asm/bitops.h) shouldn't be probed. 

It would be nice to get a backport of this patch into older popular kernels.

> Its not clear how to
> add a set of inline functions in a header file in the blacklist in one go.

See tapsets.cxx:build_blacklist().


> d. The functioncallcount.stp needs to be modified to just probe .call and not
> inlines.

Sure.
Comment 15 Ananth Mavinakayanahalli 2009-02-11 06:16:39 UTC
Updated functioncallcount.stp.

The blacklist in tapsets.cxx needs updates:

---
 tapsets.cxx |    4 ++++
 1 file changed, 4 insertions(+)

Index: systemtap-11feb/tapsets.cxx
===================================================================
--- systemtap-11feb.orig/tapsets.cxx
+++ systemtap-11feb/tapsets.cxx
@@ -3144,7 +3144,11 @@ dwarf_query::build_blacklist()
 
   blfile += "kernel/kprobes.c"; // first alternative, no "|"
   blfile += "|arch/.*/kernel/kprobes.c";
+  // Older kernels need ...
   blfile += "|include/asm/io.h";
+  // While newer ones need ...
+  blfile += "|arch/.*/include/asm/io.h";
+  blfile += "|arch/.*/include/asm/bitops.h";
   blfile += "|drivers/ide/ide-iops.c";
 
   // XXX: it would be nice if these blacklisted functions were pulled

I haven't committed the above -- need validation if bitops.h needs blacklisting
for all archs or be restricted to powerpc.

Comments?
Comment 16 Ananth Mavinakayanahalli 2009-02-11 06:18:41 UTC
Created attachment 3730 [details]
Updated patch for the blacklist...

Updated the patch above.
Comment 17 Ananth Mavinakayanahalli 2009-02-12 14:34:04 UTC
275a898f4 committed
Comment 18 Ananth Mavinakayanahalli 2009-02-12 14:35:49 UTC
Commit eef336189 in the upstream kernel is the companion fix for powerpc's
instruction emulation.
Comment 19 Ananth Mavinakayanahalli 2009-02-12 14:52:28 UTC
Subject: Re:  functioncallcount.stp causes system crash

On Tue, Feb 10, 2009 at 03:54:54PM -0000, fche at redhat dot com wrote:
 
> It would be nice to get a backport of this patch into older popular kernels.

Frank,
Were you referring to the powerpc instruction emulation bug? If so, that
bug doesn't exist in any of the EL kernels so far AFAICS.

Ananth.