Bug 2091

Summary: system crash when running "./systemtap.stress/current.stp" on power
Product: systemtap Reporter: Jim Keniston <jkenisto>
Component: kprobesAssignee: Ananth Mavinakayanahalli <ananth>
Status: RESOLVED WORKSFORME    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Patch against 2.6.9-42.EL

Description Jim Keniston 2005-12-22 02:42:33 UTC
"Gui,Jian" <guij@cn.ibm.com> reported:
-----
My environment is systemtap-snapshot1217, elfutils-0.118-0.1 and
redhat kernel 2.6.9-24.EL on an power4.

When I ran "tests/testsuite/systemtap.stress/current.stp", the system
always crashed. Here is the smallest code segment which will
cause the crash:

  probe kernel.function("*@kernel/sched.c"),
    kernel.function("*@kernel/sched.c").return {}

And here is more information at the breaking point:

0:mon> e
cpu 0x0: Vector: 700 (Program Check) at [c0000001b1313bb0]
    pc: d000000000570624: dwarf_kprobe_1+0x2a0c/0xfffffffffffaf828 
[stap_5462]
    lr: c000000000047dac: .kretprobe_trampoline_holder+0x0/0x8
    sp: c0000001b1313e30
   msr: 8000000000089432
  current = 0xc00000000fc2e040
  paca    = 0xc0000000003f2400
    pid   = 5523, comm = stpd
0:mon> t
[link register   ] c000000000047dac 
.kretprobe_trampoline_holder+0x0/0x8
[c0000001b1313e30] c000000000011280 syscall_exit+0x0/0x18 (unreliable)
--- Exception: c01 (System Call) at 000000000fd4c8d4
SP (ffffe2e0) is in userspace
0:mon> r
R00 = 0000000000000008   R16 = 0000000008028c78
R01 = c0000001b1313e30   R17 = 0000000008028cb0
R02 = c0000000004ec980   R18 = 0000000000000000
R03 = 0000000000004000   R19 = 0000000008008fa8
R04 = 0000000000000028   R20 = 0000000008028c80
R05 = 0000000044000428   R21 = 0000000000000001
R06 = 0000000000000000   R22 = 0000000000000001
R07 = 0000000000000080   R23 = 0000000010010000
R08 = 000000000000d032   R24 = 000000000ffa69d0
R09 = c0000001b1310000   R25 = 0000000010000000
R10 = 8000000000009032   R26 = 0000000010010000
R11 = c0000000002f9c44   R27 = 0000000010010000
R12 = c0000001b1310000   R28 = 0000000010010000
R13 = c0000000003f2400   R29 = 0000000000000004
R14 = 0000000000000001   R30 = 0000000010010000
R15 = 0000000000000000   R31 = 0000000010010000
pc  = d000000000570624 dwarf_kprobe_1+0x2a0c/0xfffffffffffaf828 
[stap_5462]
lr  = c000000000047dac .kretprobe_trampoline_holder+0x0/0x8
msr = 8000000000089432   cr  = 48000442
ctr = c0000000002f9c44   xer = 0000000000000000   trap = 700

Any suggestion about this ?
Thanks in advance.
-----
My suggestions for diagnosing this bug include:
1. Try the same thing without the entry probes.
2. Try the same thing without the return probes.
3. Run "stap -p3 xxx.stp > xxx.c" and extract the list of kretprobe probe
addresses (dwarf_kprobe_1[]?).  Build a C module that establishes entry kprobes
and/or return probes for all these functions.  See if that crashes.  If so, keep
removing functions from the list until you get a module that doesn't cause a
crash.  Keep playing with the list until you figure out a minimal list to
demonstrate the bug.
Comment 1 Gui,Jian 2005-12-22 03:07:10 UTC
Subject: Re:  New: system crash when running "./systemtap.stress/current.stp" on power

>> My suggestions for diagnosing this bug include:
> 1. Try the same thing without the entry probes.
> 2. Try the same thing without the return probes.
> 3. Run "stap -p3 xxx.stp > xxx.c" and extract the list of kretprobe 
> probe
> addresses (dwarf_kprobe_1[]?).  Build a C module that establishes 
> entry kprobes
> and/or return probes for all these functions.  See if that crashes. 
> If so, keep
> removing functions from the list until you get a module that doesn't 
> cause a
> crash.  Keep playing with the list until you figure out a minimal 
> list to
> demonstrate the bug.

Thanks for your suggestions. I've tried 1 and 2, and no crashes. I'll 
try 3
to minimize the list.


Comment 2 Hien Nguyen 2006-01-17 18:18:27 UTC
With Anil's fix for bz#2071 applied to kernel v2.6.15-rc5 and  modify the
systemtap.stress/current.stp (comment out probe module("*"), since it does not
work on ppc64). I was able to run the test on Power 5. 

Here's the output of the test

systemtap starting
systemtap ending probe
count = 6535502
sum = 22080034
min = 2
max = 15
avg = 3
systemtap test success
systemtap test success
WARNING: Number of errors: 0, skipped probes: 47933
Running rm -rf /tmp/stapU0gKN6


Comment 3 Ananth Mavinakayanahalli 2006-08-04 05:01:10 UTC
I tried to recreate this problem on a POWER4 LPAR:

[root@llm16 systemtap.stress]# cat /proc/cpuinfo
processor       : 0
cpu             : POWER4+ (gq)
clock           : 1200.791720MHz
revision        : 18.3 (pvr 0038 1203)

processor       : 1
cpu             : POWER4+ (gq)
clock           : 1200.791720MHz
revision        : 18.3 (pvr 0038 1203)

timebase        : 150098965
platform        : pSeries
machine         : CHRP IBM,7028-6C4

The test ran just fine, for two iterations atleast:

[root@llm16 systemtap.stress]# stap -g current.stp
systemtap starting probe
systemtap ending probe
count = 5889610
sum = 40002649
min = 3
max = 11
avg = 6
systemtap test success
systemtap test success
[root@llm16 systemtap.stress]# stap -g current.stp
systemtap starting probe
systemtap ending probe
count = 16514488
sum = 112075474
min = 3
max = 15
avg = 6
systemtap test success
systemtap test success
WARNING: Number of errors: 0, skipped probes: 440
[root@llm16 systemtap.stress]#

This machine is running RHEL (U3)
[root@llm16 systemtap.stress]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 3)

But the kernel running at the time is upstream 2.6.18-rc3 (compiled with
pseries_defconfig):

[root@llm16 systemtap.stress]# uname -a
Linux llm16.in.ibm.com 2.6.18-rc3 #4 SMP Wed Aug 2 17:50:14 IST 2006 ppc64 ppc64
ppc64 GNU/Linux

I'll try to increase the test run duration to see if the problem can be recreated. 

Mike, Jian Gui, was the problem with just the RHEL4-Ux kernel? Can you please
try the same test with the upstream kernel?

Ananth
Comment 4 Gui,Jian 2006-08-04 09:43:03 UTC
We've changed our machines soon after the original bug report,
thus now I have to try the same test on my Power5 lpar.

I can run this test successfully. I think this bug has been fixed 
as Hien mentioned above and we can close this bug.

The environment is also RHEL4_U3 and kernel 2.6.18-rc3 (compiled
with pseries_defconfig).

root:systemtap.stress>cat /proc/cpuinfo
...
processor       : 7
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

timebase        : 188044000
platform        : pSeries
machine         : CHRP IBM,9124-720
root:systemtap.stress>stap -g current.stp
systemtap starting probe
systemtap ending probe
count = 53479304
sum = 316413369
min = 4
max = 11
avg = 5
systemtap test success
systemtap test success
root:systemtap.stress>stap -g current.stp
systemtap starting probe
systemtap ending probe
count = 13909012
sum = 74516691
min = 2
max = 15
avg = 5
systemtap test success
systemtap test success
WARNING: Number of errors: 0, skipped probes: 2020
Comment 5 Mike Mason 2006-08-04 16:44:26 UTC
This bug shouldn't be closed yet. The fix Hien mentions in comment #2 was not
accepted into the kernel.  We're currently seeing this problem with SLES 10 on
power4 and the latest systemtap snapshot.  We are not seeing the problem on
power5, although I don't think this is a power4 vs power5 issue per se. I think
it's partially related to which functions the wildcards resolve to, especially
for modules.  I'll update this report with more details later today.


Comment 6 Jim Keniston 2006-11-16 01:15:33 UTC
Somebody from IBM needs to follow up on this.
Comment 7 Ananth Mavinakayanahalli 2006-11-16 12:26:35 UTC
Subject: Re:  system crash when running "./systemtap.stress/current.stp" on power

On Thu, Nov 16, 2006 at 01:15:33AM -0000, jkenisto at us dot ibm dot com wrote:
> 
> ------- Additional Comments From jkenisto at us dot ibm dot com  2006-11-16 01:15 -------
> Somebody from IBM needs to follow up on this.
> 
> -- 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>          AssignedTo|systemtap at sources dot    |ananth at in dot ibm dot com
>                    |redhat dot com              |
>              Status|NEW                         |ASSIGNED
> 
> 
> http://sourceware.org/bugzilla/show_bug.cgi?id=2091

Amit will be looking into this issue.

Ananth
Comment 8 Ananth Mavinakayanahalli 2006-11-17 06:11:10 UTC
Created attachment 1425 [details]
Patch against 2.6.9-42.EL

Patch that fixed the power4-only itrace bug. I don't know for sure if this is
needed on RHEL4, but its worth a test.
Comment 9 Ananth Mavinakayanahalli 2007-06-07 10:10:03 UTC
A survey of the weekly snapshot tests for powerpc shows that current.stp is
working fine.

Closing this bug as WORKSFORME. If this problem is seen again, we can reopen
this bug.