Bug 2387 - system crash on ppc64/2.6.15.4
Summary: system crash on ppc64/2.6.15.4
Status: RESOLVED DUPLICATE of bug 2406
Alias: None
Product: systemtap
Classification: Unclassified
Component: kprobes (show other bugs)
Version: unspecified
: P2 enhancement
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-23 09:40 UTC by Li Guanglei
Modified: 2006-03-02 06:16 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Li Guanglei 2006-02-23 09:40:47 UTC
when running dbench, I use systemtap to probe all syscalls on ppc64/2.6.15.4,
the system will crash shortly. the error given by xmon:

Unable to handle kernel paging request for data at address 0x00000010
Faulting instruction address: 0xd000000000270ee4
cpu 0x1: Vector: 300 (Data Access) at [c00000005d95a8c0]
    pc: d000000000270ee4: ._stp_print_flush+0xb8/0x164 [stap_13972]
    lr: d000000000272a94: .probe_1+0x374/0x400 [stap_13972]
    sp: c00000005d95ab40
   msr: 8000000000001032
   dar: 10
 dsisr: 40000000
  current = 0xc000000020739040
  paca    = 0xc000000000538400
    pid   = 25259, comm = hotplug
enter ? for help

1:mon> t
[c00000005d95abf0] d000000000272a94 .probe_1+0x374/0x400 [stap_13972]
[c00000005d95ac90] d000000000272cf4 .dwarf_kprobe_1_enter+0x13c/0x1d8 [stap_13972]
[c00000005d95ad10] c00000000041959c .kprobe_exceptions_notify+0x334/0x5e8
[c00000005d95add0] c00000000041a134 .notifier_call_chain+0x68/0x98
[c00000005d95ae60] c000000000418834 .program_check_exception+0x114/0x5d0
[c00000005d95af00] c000000000004348 program_check_common+0xc8/0x100
--- Exception: 700 (Program Check) at c0000000000b0b94
.__find_get_block_slow+0x0/0x174
[link register   ] c0000000000b1940 .__find_get_block+0x110/0x278
[c00000005d95b1f0] c00000000027c6b0 .put_device+0x1c/0x30 (unreliable)
[c00000005d95b2d0] c0000000000b5184 .__getblk+0x44/0x2cc
[c00000005d95b390] c00000000013d678 .__ext3_get_inode_loc+0x1b0/0x42c
[c00000005d95b450] c00000000013e568 .ext3_reserve_inode_write+0x58/0x11c
[c00000005d95b500] c00000000013e650 .ext3_mark_inode_dirty+0x24/0x5c
[c00000005d95b5b0] c000000000140df0 .ext3_dirty_inode+0x8c/0xbc
[c00000005d95b640] c0000000000ddcb4 .__mark_inode_dirty+0x70/0x1e8
[c00000005d95b6e0] c0000000000d105c .update_atime+0xa4/0xbc
[c00000005d95b770] c0000000000802e8 .do_generic_mapping_read+0x41c/0x474
[c00000005d95b8c0] c000000000082b4c .__generic_file_aio_read+0x1b4/0x21c
[c00000005d95b990] c000000000082d5c .generic_file_aio_read+0x44/0x54
[c00000005d95ba20] c0000000000ae520 .do_sync_read+0xcc/0x124
[c00000005d95bba0] c0000000000ae65c .vfs_read+0xe4/0x1b8
[c00000005d95bc40] c0000000000bd7a4 .kernel_read+0x34/0x58
[c00000005d95bce0] c0000000000e87b4 .compat_do_execve+0x15c/0x2c8
[c00000005d95bd90] c000000000012744 .compat_sys_execve+0x7c/0xf8
[c00000005d95be30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fef6004
SP (ffc403c0) is in userspace
Comment 1 Frank Ch. Eigler 2006-02-23 12:34:26 UTC
If I read this correctly, .__find_get_block_slow suffered some kind of fault. 
Could you disassemble your kernel in its neighbourhood to figure out which part
of that function triggered it?

Also, I don't understand how the kprobe was entered.  The exception notification
stuff should not result in launching into a kprobe.  Systemtap does not set any
"kp_fault_handler" at the present.  Does the "stap -p3" source code suggest any
linkage of dwarf_kprobe_1_enter to kprobe_exception_notify?  Might there simply
be a structure initialization issue?
Comment 2 Li Guanglei 2006-02-23 15:23:40 UTC
The following is the disassembly given by objdump:

Disassambly inside __find_get_block:
c0000000000b1934:    mr      r31,r6
c0000000000b1938:    bne-    cr7,c0000000000b1a68 <.__find_get_block+0x238>
c0000000000b193c:    bl      c0000000000b0b94 <.__find_get_block_slow>
c0000000000b1940:    mr.     r31,r3
c0000000000b1944:    beq-    c0000000000b1a68 <.__find_get_block+0x238>
c0000000000b1948:    li      r27,0
c0000000000b194c:    mfmsr   r0


disassambly around __find_get_block_slow:
c0000000000b0b8c <.sys_fdatasync>:
c0000000000b0b8c:    li      r4,1
c0000000000b0b90:    b       c0000000000b0a10 <.do_fsync>

c0000000000b0b94 <.__find_get_block_slow>:
c0000000000b0b94:    mflr    r0
c0000000000b0b98:    std     r24,-64(r1)
c0000000000b0b9c:    std     r25,-56(r1)
c0000000000b0ba0:    std     r28,-32(r1)
c0000000000b0ba4:    std     r29,-24(r1)
c0000000000b0ba8:    mr      r24,r4

But I wonder whether such info given by xmon is useful. I tried several times, 
and it will crash every time and showed a different exception & backtrace. And I 
noticed that all of these errors will have:

Unable to handle kernel paging request for data at address ...


--------------- Testing One ---------------------------------

Unable to handle kernel paging request for data at address 0x00000010
Faulting instruction address: 0xd000000000270ee4
cpu 0x1: Vector: 300 (Data Access) at [c000000040dab3f0]
    pc: d000000000270ee4: ._stp_print_flush+0xb8/0x164 [stap_7259]
    lr: d000000000273cb4: .probe_4+0x374/0x400 [stap_7259]
    sp: c000000040dab670
   msr: 8000000000001032
   dar: 10
 dsisr: 40000000
  current = 0xc00000002a351040
  paca    = 0xc000000000538400
    pid   = 9179, comm = dbench
enter ? for help

1:mon> t
[c000000040dab720] d000000000273cb4 .probe_4+0x374/0x400 [stap_7259]
[c000000040dab7c0] d000000000273e6c .dwarf_kprobe_4_enter+0x12c/0x1c8 
[stap_7259]
[c000000040dab840] c000000000419164 .trampoline_probe_handler+0xb0/0x150
[c000000040dab8e0] c00000000041959c .kprobe_exceptions_notify+0x334/0x5e8
[c000000040dab9a0] c00000000041a134 .notifier_call_chain+0x68/0x98
[c000000040daba30] c000000000418834 .program_check_exception+0x114/0x5d0
[c000000040dabad0] c000000000004348 program_check_common+0xc8/0x100
--- Exception: 700 (Program Check) at c00000000002a3bc kretprobe_trampoline+0x0/
0x8
[c000000040dabe30] c00000000002a3bc kretprobe_trampoline+0x0/0x8
--- Exception: c01 (System Call) at 000000000ff201b8
SP (ff9000b0) is in userspace
1:mon> 

----------- Testing Two -----------------------------------

localhost.localdomain login: Unable to handle kernel paging request for data at 
address 0x00000010
Faulting instruction address: 0xd000000000270ee4
cpu 0x1: Vector: 300 (Data Access) at [c000000066eeb500]
    pc: d000000000270ee4: ._stp_print_flush+0xb8/0x164 [stap_3949]
    lr: d0000000002736dc: .probe_3+0x374/0x400 [stap_3949]
    sp: c000000066eeb780
   msr: 8000000000001032
   dar: 10
 dsisr: 40000000
  current = 0xc000000002423040
  paca    = 0xc000000000538400
    pid   = 17224, comm = env
enter ? for help
1:mon> t
[c000000066eeb830] d0000000002736dc .probe_3+0x374/0x400 [stap_3949]
[c000000066eeb8d0] d0000000002738a4 .dwarf_kprobe_3_enter+0x13c/0x1d8 
[stap_3949]
[c000000066eeb950] c00000000041959c .kprobe_exceptions_notify+0x334/0x5e8
[c000000066eeba10] c00000000041a134 .notifier_call_chain+0x68/0x98
[c000000066eebaa0] c000000000418834 .program_check_exception+0x114/0x5d0
[c000000066eebb40] c000000000004348 program_check_common+0xc8/0x100
--- Exception: 700 (Program Check) at c00000000000ae38 .ppc_newuname+0x14/0x120
[link register   ] c00000000002a3bc kretprobe_trampoline+0x0/0x8
[c000000066eebe30] c000000000004760 .handle_page_fault+0x20/0x54 (unreliable)
--- Exception: c01 (System Call) at 000000000ffe2958
SP (fff6a970) is in userspace
1:mon> 

----------------------------------------------------------


kprobe_exceptions_notify could be triggered by breakpoint or singstep trap. 
kprobe_exceptions_notify will check and if it was triggered by BreadkPoint, it 
will invoke kprobe_handler which will then invoke kprobe->pre_handler, i.e. the 
probe handlers. and the stap -p3 shows:
 dwarf_kprobe_1[i].pre_handler = &dwarf_kprobe_1_enter;

So I think the exception notification stuff *could* result in launching into a 
kprobe. Am I wrong with something?

Comment 3 Li Guanglei 2006-03-01 14:41:34 UTC
I tried the 2.6.15.1-2.6.15.4 and 2.6.16-rc5 kernels, and all of them gave
almost the same error like:
Unable to handle kernel paging request for data at address ...

And if I don't use -b option of systemtap, it seemed that it could run for a
long time without kernel panic.

And I also noticed that the kernel reported the I/O error even when I wasn't
running systemtap and only did some simple writing operations:
end_request: I/O error, dev sda, sector 17445
end_request: I/O error, dev sda, sector 17447
end_request: I/O error, dev sda, sector 17449
Aborting journal on device sda2.
ext3_abort called.
EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

The same version of systemtap could run very well with 2.6.9-30EL, so it is a
bug of the mainline kernel.
Comment 4 Jose R. Santos 2006-03-01 16:25:16 UTC
If you are seen problem even when not using SystemTap the this is probably
something outside of SystemTap.  I suggest following this up on the linux-kernel
and linuxppc64-dev mailing list to see if the problems is located in the kernel.

We should mark this bug as rejected until its proven that it is a SystemTap problem.
Comment 5 Li Guanglei 2006-03-01 16:36:18 UTC
(In reply to comment #4)
> If you are seen problem even when not using SystemTap the this is probably
> something outside of SystemTap.  I suggest following this up on the linux-kernel
> and linuxppc64-dev mailing list to see if the problems is located in the kernel.
> 
> We should mark this bug as rejected until its proven that it is a SystemTap
problem.

the error : end_request: I/O error, dev sda, sector 17445 ...
will happen without running systemtap. It will occur after I copied something
into that partition. But I am not sure if it is the reason of causing kernel
panic when running systemtap.

The error:
Unable to handle kernel paging request for data at address
will happed when running stap with -b option.
But I agree with Jose that it may not be a systemtap bug, because systemtap
could work quite well on the redhat shipped kernels(2.6.9-30.EL, 2.6.9-27.EL).

It should not be a hardware failure because I tried it on different machines,
and even after reformat the partition. all of them have the same error.

The 2.6.15 kernel has some changes about power arch(move ppc64 to powerpc
directory), and the relayfs diffs a lot from RH shipped kernel. I tried not to
compile relayfs in 2.6.15* and want systemtap compile it, but failed. the
relayfs shipped with systemtap can't be compiled. some function signatures has
changed, and if I have time I'll try to replace relayfs.



Comment 6 Tom Zanussi 2006-03-01 16:56:49 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > If you are seen problem even when not using SystemTap the this is probably
> > something outside of SystemTap.  I suggest following this up on the linux-kernel
> > and linuxppc64-dev mailing list to see if the problems is located in the kernel.
> > 
> > We should mark this bug as rejected until its proven that it is a SystemTap
> problem.
> 
> the error : end_request: I/O error, dev sda, sector 17445 ...
> will happen without running systemtap. It will occur after I copied something
> into that partition. But I am not sure if it is the reason of causing kernel
> panic when running systemtap.
> 
> The error:
> Unable to handle kernel paging request for data at address
> will happed when running stap with -b option.
> But I agree with Jose that it may not be a systemtap bug, because systemtap
> could work quite well on the redhat shipped kernels(2.6.9-30.EL, 2.6.9-27.EL).
> 
> It should not be a hardware failure because I tried it on different machines,
> and even after reformat the partition. all of them have the same error.
> 
> The 2.6.15 kernel has some changes about power arch(move ppc64 to powerpc
> directory), and the relayfs diffs a lot from RH shipped kernel. I tried not to
> compile relayfs in 2.6.15* and want systemtap compile it, but failed. the
> relayfs shipped with systemtap can't be compiled. some function signatures has
> changed, and if I have time I'll try to replace relayfs.
> 
> 
> 
> 

To get systemtap to use the relayfs in the 2.6.15 kernel, try putting #define
RELAYFS_VERSION_GE_4 at the top of src/runtime/transport/relayfs.h.

Tom
Comment 7 Tom Zanussi 2006-03-02 05:08:22 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > If you are seen problem even when not using SystemTap the this is probably
> > > something outside of SystemTap.  I suggest following this up on the
linux-kernel
> > > and linuxppc64-dev mailing list to see if the problems is located in the
kernel.
> > > 
> > > We should mark this bug as rejected until its proven that it is a SystemTap
> > problem.
> > 
> > the error : end_request: I/O error, dev sda, sector 17445 ...
> > will happen without running systemtap. It will occur after I copied something
> > into that partition. But I am not sure if it is the reason of causing kernel
> > panic when running systemtap.
> > 
> > The error:
> > Unable to handle kernel paging request for data at address
> > will happed when running stap with -b option.
> > But I agree with Jose that it may not be a systemtap bug, because systemtap
> > could work quite well on the redhat shipped kernels(2.6.9-30.EL, 2.6.9-27.EL).
> > 
> > It should not be a hardware failure because I tried it on different machines,
> > and even after reformat the partition. all of them have the same error.
> > 
> > The 2.6.15 kernel has some changes about power arch(move ppc64 to powerpc
> > directory), and the relayfs diffs a lot from RH shipped kernel. I tried not to
> > compile relayfs in 2.6.15* and want systemtap compile it, but failed. the
> > relayfs shipped with systemtap can't be compiled. some function signatures has
> > changed, and if I have time I'll try to replace relayfs.
> > 
> > 
> > 
> > 
> 
> To get systemtap to use the relayfs in the 2.6.15 kernel, try putting #define
> RELAYFS_VERSION_GE_4 at the top of src/runtime/transport/relayfs.h.
> 
> Tom

I don't know if this is or isn't the cause of the problem, since I'm not seeing
it on my x86 test machine, but I do see that the wrong relayfs_fs.h header file
(the one in runtime/relayfs/linux/ rather than the one in the installed kernel
sources) is being used to generate the probe module, when running a 2.6.15
kernel without the RELAYFS_VERSION_GE_4 define in relayfs.h.

Can you go ahead and try adding that define and see if it helps? i.e. add
#define RELAYFS_VERSION_GE_4 to src/runtime/transport/relayfs.h and then do a
'make install' to get it installed.  Also make sure you have relayfs configured
into your kernel.

If that's the problem, then this bug could probably be closed and would be fixed
by 2406, which deals with autodetecting the proper relayfs version, including
this one.
Comment 8 Li Guanglei 2006-03-02 05:36:01 UTC
> I don't know if this is or isn't the cause of the problem, since I'm not seeing
> it on my x86 test machine, but I do see that the wrong relayfs_fs.h header file
> (the one in runtime/relayfs/linux/ rather than the one in the installed kernel
> sources) is being used to generate the probe module, when running a 2.6.15
> kernel without the RELAYFS_VERSION_GE_4 define in relayfs.h.
> 
> Can you go ahead and try adding that define and see if it helps? i.e. add
> #define RELAYFS_VERSION_GE_4 to src/runtime/transport/relayfs.h and then do a
> 'make install' to get it installed.  Also make sure you have relayfs configured
> into your kernel.
> 
> If that's the problem, then this bug could probably be closed and would be fixed
> by 2406, which deals with autodetecting the proper relayfs version, including
> this one.

I tried, and it worked. Thanks. It seems not crash any more.
But there is some errors(in fact, warnings) when stap is compiling the module, I
bypassed it by delete the -Werror in buildrun.cxx:

Running grep " [tT] " /proc/kallsyms | sort -k 1,8 -s -o
/tmp/stap2iLdUc/symbols.sorted
Pass 3: translated to C into "/tmp/stap2iLdUc/stap_6318.c" in
280usr/1000sys/1294real ms.
Running make -C "/lib/modules/2.6.9-30.EL/build" M="/tmp/stap2iLdUc" modules V=1
make: Entering directory `/usr/src/kernels/2.6.9-30.EL-ppc64'
mkdir -p /tmp/stap2iLdUc/.tmp_versions
make -f scripts/Makefile.build obj=/tmp/stap2iLdUc
  gcc -m64 -Wp,-MD,/tmp/stap2iLdUc/.stap_6318.o.d -nostdinc -iwithprefix include
-D__KERNEL__ -Iinclude  -Wall -Wstrict-prototypes -Wno-trigraphs
-fno-strict-aliasing -fno-common -Os -g -Wdeclaration-after-statement
-msoft-float -pipe -mminimal-toc -mtraceback=none -mcall-aixdesc               
    -mtune=power4 -fno-unit-at-a-time -Wno-unused -Werror -I
"/usr/local/share/systemtap/runtime" -I
"/usr/local/share/systemtap/runtime/relayfs"   -DMODULE
-DKBUILD_BASENAME=stap_6318 -DKBUILD_MODNAME=stap_6318 -c -o
/tmp/stap2iLdUc/.tmp_stap_6318.o /tmp/stap2iLdUc/stap_6318.c
In file included from /usr/local/share/systemtap/runtime/transport/transport.c:20,
                 from /usr/local/share/systemtap/runtime/io.c:14,
                 from /usr/local/share/systemtap/runtime/print.c:16,
                 from /usr/local/share/systemtap/runtime/runtime.h:61,
                 from /tmp/stap2iLdUc/stap_6318.c:30:
/usr/local/share/systemtap/runtime/transport/relayfs.c: In function
`_stp_subbuf_start':
/usr/local/share/systemtap/runtime/transport/relayfs.c:33: warning: implicit
declaration of function `relay_buf_full'
/usr/local/share/systemtap/runtime/transport/relayfs.c:39: warning: implicit
declaration of function `subbuf_start_reserve'
/usr/local/share/systemtap/runtime/transport/relayfs.c: At top level:
/usr/local/share/systemtap/runtime/transport/relayfs.c:77: warning:
initialization from incompatible pointer type
/usr/local/share/systemtap/runtime/transport/relayfs.c: In function
`_stp_relayfs_open':
/usr/local/share/systemtap/runtime/transport/relayfs.c:129: warning: passing arg
5 of `relay_open' makes integer from pointer without a cast
/usr/local/share/systemtap/runtime/transport/relayfs.c:129: error: too few
arguments to function `relay_open'
In file included from /usr/local/share/systemtap/runtime/transport/transport.c:45,
                 from /usr/local/share/systemtap/runtime/io.c:14,
                 from /usr/local/share/systemtap/runtime/print.c:16,
                 from /usr/local/share/systemtap/runtime/runtime.h:61,
                 from /tmp/stap2iLdUc/stap_6318.c:30:
/usr/local/share/systemtap/runtime/transport/procfs.c: In function `_stp_proc_read':
/usr/local/share/systemtap/runtime/transport/procfs.c:35: error: incompatible
types in assignment
/usr/local/share/systemtap/runtime/transport/procfs.c:36: error: incompatible
types in assignment
In file included from /usr/local/share/systemtap/runtime/io.c:14,
                 from /usr/local/share/systemtap/runtime/print.c:16,
                 from /usr/local/share/systemtap/runtime/runtime.h:61,
                 from /tmp/stap2iLdUc/stap_6318.c:30:
/usr/local/share/systemtap/runtime/transport/transport.c: In function
`_stp_handle_buf_info':
/usr/local/share/systemtap/runtime/transport/transport.c:86: error: incompatible
types in assignment
/usr/local/share/systemtap/runtime/transport/transport.c:87: error: incompatible
types in assignment
make[1]: *** [/tmp/stap2iLdUc/stap_6318.o] Error 1
make: *** [_module_/tmp/stap2iLdUc] Error 2
make: Leaving directory `/usr/src/kernels/2.6.9-30.EL-ppc64'
Pass 4: compiled C into "stap_6318.ko" in 2820usr/220sys/2893real ms.
Pass 4: compilation failed.  Try again with more '-v' (verbose) options.
Running rm -rf /tmp/stap2iLdUc
Comment 9 Li Guanglei 2006-03-02 05:48:12 UTC
> I tried, and it worked. Thanks. It seems not crash any more.
> But there is some errors(in fact, warnings) when stap is compiling the module, I
> bypassed it by delete the -Werror in buildrun.cxx:
The error on 2.6.15.3 kernel will be(with -Werror in buildrun.cxx):

Running grep " [tT] " /proc/kallsyms | sort -k 1,8 -s -o
/tmp/stap5mvGWl/symbols.sorted
Pass 3: translated to C into "/tmp/stap5mvGWl/stap_12492.c" in
220usr/90sys/313real ms.
Running make -C "/lib/modules/2.6.15.3/build" M="/tmp/stap5mvGWl" modules V=1
make: Entering directory `/usr/src/linux-2.6.15.3'
mkdir -p /tmp/stap5mvGWl/.tmp_versions
make -f scripts/Makefile.build obj=/tmp/stap5mvGWl
  gcc -m64 -Wp,-MD,/tmp/stap5mvGWl/.stap_12492.o.d  -nostdinc -isystem
/usr/lib/gcc/ppc64-redhat-linux/3.4.5/include -D__KERNEL__ -Iinclude  -include
include/linux/autoconf.h  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs
-fno-strict-aliasing -fno-common -ffreestanding -Os     -fomit-frame-pointer -g
-msoft-float -pipe -mminimal-toc -mtraceback=none  -mcall-aixdesc -mtune=power4
-mno-altivec -funit-at-a-time -mstring -Wa,-maltivec
-Wdeclaration-after-statement  -Wno-unused -Werror -I
"/usr/local/share/systemtap/runtime" -I
"/usr/local/share/systemtap/runtime/relayfs"   -DMODULE
-DKBUILD_BASENAME=stap_12492 -DKBUILD_MODNAME=stap_12492 -c -o
/tmp/stap5mvGWl/.tmp_stap_12492.o /tmp/stap5mvGWl/stap_12492.c
In file included from /usr/local/share/systemtap/runtime/transport/transport.c:20,
                 from /usr/local/share/systemtap/runtime/io.c:14,
                 from /usr/local/share/systemtap/runtime/print.c:16,
                 from /usr/local/share/systemtap/runtime/runtime.h:61,
                 from /tmp/stap5mvGWl/stap_12492.c:30:
/usr/local/share/systemtap/runtime/transport/relayfs.c:77: warning:
initialization from incompatible pointer type
make[1]: *** [/tmp/stap5mvGWl/stap_12492.o] Error 1
make: *** [_module_/tmp/stap5mvGWl] Error 2
make: Leaving directory `/usr/src/linux-2.6.15.3'
Pass 4: compiled C into "stap_12492.ko" in 2210usr/250sys/2104real ms.
Pass 4: compilation failed.  Try again with more '-v' (verbose) options.
Running rm -rf /tmp/stap5mvGWl

So we need to do some explicit type cast to eliminate such warnings?
Comment 10 Tom Zanussi 2006-03-02 05:53:02 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > If you are seen problem even when not using SystemTap the this is probably
> > > something outside of SystemTap.  I suggest following this up on the
linux-kernel
> > > and linuxppc64-dev mailing list to see if the problems is located in the
kernel.
> > > 
> > > We should mark this bug as rejected until its proven that it is a SystemTap
> > problem.
> > 
> > the error : end_request: I/O error, dev sda, sector 17445 ...
> > will happen without running systemtap. It will occur after I copied something
> > into that partition. But I am not sure if it is the reason of causing kernel
> > panic when running systemtap.
> > 
> > The error:
> > Unable to handle kernel paging request for data at address
> > will happed when running stap with -b option.
> > But I agree with Jose that it may not be a systemtap bug, because systemtap
> > could work quite well on the redhat shipped kernels(2.6.9-30.EL, 2.6.9-27.EL).
> > 
> > It should not be a hardware failure because I tried it on different machines,
> > and even after reformat the partition. all of them have the same error.
> > 
> > The 2.6.15 kernel has some changes about power arch(move ppc64 to powerpc
> > directory), and the relayfs diffs a lot from RH shipped kernel. I tried not to
> > compile relayfs in 2.6.15* and want systemtap compile it, but failed. the
> > relayfs shipped with systemtap can't be compiled. some function signatures has
> > changed, and if I have time I'll try to replace relayfs.
> > 
> > 
> > 
> > 
> 
> To get systemtap to use the relayfs in the 2.6.15 kernel, try putting #define
> RELAYFS_VERSION_GE_4 at the top of src/runtime/transport/relayfs.h.
> 
> Tom

I don't know if this is or isn't the cause of the problem, since I'm not seeing
it on my x86 test machine, but I do see that the wrong relayfs_fs.h header file
(the one in runtime/relayfs/linux/ rather than the one in the installed kernel
sources) is being used to generate the probe module, when running a 2.6.15
kernel without the RELAYFS_VERSION_GE_4 define in relayfs.h.

Can you go ahead and try adding that define and see if it helps? i.e. add
#define RELAYFS_VERSION_GE_4 to src/runtime/transport/relayfs.h and then do a
'make install' to get it installed.  Also make sure you have relayfs configured
into your kernel.

If that's the problem, then this bug could probably be closed and would be fixed
by 2406, which deals with autodetecting the proper relayfs version, including
this one.(In reply to comment #8)
> > I don't know if this is or isn't the cause of the problem, since I'm not seeing
> > it on my x86 test machine, but I do see that the wrong relayfs_fs.h header file
> > (the one in runtime/relayfs/linux/ rather than the one in the installed kernel
> > sources) is being used to generate the probe module, when running a 2.6.15
> > kernel without the RELAYFS_VERSION_GE_4 define in relayfs.h.
> > 
> > Can you go ahead and try adding that define and see if it helps? i.e. add
> > #define RELAYFS_VERSION_GE_4 to src/runtime/transport/relayfs.h and then do a
> > 'make install' to get it installed.  Also make sure you have relayfs configured
> > into your kernel.
> > 
> > If that's the problem, then this bug could probably be closed and would be fixed
> > by 2406, which deals with autodetecting the proper relayfs version, including
> > this one.
> 
> I tried, and it worked. Thanks. It seems not crash any more.
> But there is some errors(in fact, warnings) when stap is compiling the module, I
> bypassed it by delete the -Werror in buildrun.cxx:
> 
> Running grep " [tT] " /proc/kallsyms | sort -k 1,8 -s -o
> /tmp/stap2iLdUc/symbols.sorted
> Pass 3: translated to C into "/tmp/stap2iLdUc/stap_6318.c" in
> 280usr/1000sys/1294real ms.
> Running make -C "/lib/modules/2.6.9-30.EL/build" M="/tmp/stap2iLdUc" modules V=1
> make: Entering directory `/usr/src/kernels/2.6.9-30.EL-ppc64'
> mkdir -p /tmp/stap2iLdUc/.tmp_versions
> make -f scripts/Makefile.build obj=/tmp/stap2iLdUc
>   gcc -m64 -Wp,-MD,/tmp/stap2iLdUc/.stap_6318.o.d -nostdinc -iwithprefix include
> -D__KERNEL__ -Iinclude  -Wall -Wstrict-prototypes -Wno-trigraphs
> -fno-strict-aliasing -fno-common -Os -g -Wdeclaration-after-statement
> -msoft-float -pipe -mminimal-toc -mtraceback=none -mcall-aixdesc               
>     -mtune=power4 -fno-unit-at-a-time -Wno-unused -Werror -I
> "/usr/local/share/systemtap/runtime" -I
> "/usr/local/share/systemtap/runtime/relayfs"   -DMODULE
> -DKBUILD_BASENAME=stap_6318 -DKBUILD_MODNAME=stap_6318 -c -o
> /tmp/stap2iLdUc/.tmp_stap_6318.o /tmp/stap2iLdUc/stap_6318.c
> In file included from /usr/local/share/systemtap/runtime/transport/transport.c:20,
>                  from /usr/local/share/systemtap/runtime/io.c:14,
>                  from /usr/local/share/systemtap/runtime/print.c:16,
>                  from /usr/local/share/systemtap/runtime/runtime.h:61,
>                  from /tmp/stap2iLdUc/stap_6318.c:30:
> /usr/local/share/systemtap/runtime/transport/relayfs.c: In function
> `_stp_subbuf_start':
> /usr/local/share/systemtap/runtime/transport/relayfs.c:33: warning: implicit
> declaration of function `relay_buf_full'
> /usr/local/share/systemtap/runtime/transport/relayfs.c:39: warning: implicit
> declaration of function `subbuf_start_reserve'
> /usr/local/share/systemtap/runtime/transport/relayfs.c: At top level:
> /usr/local/share/systemtap/runtime/transport/relayfs.c:77: warning:
> initialization from incompatible pointer type
> /usr/local/share/systemtap/runtime/transport/relayfs.c: In function
> `_stp_relayfs_open':
> /usr/local/share/systemtap/runtime/transport/relayfs.c:129: warning: passing arg
> 5 of `relay_open' makes integer from pointer without a cast
> /usr/local/share/systemtap/runtime/transport/relayfs.c:129: error: too few
> arguments to function `relay_open'
> In file included from /usr/local/share/systemtap/runtime/transport/transport.c:45,
>                  from /usr/local/share/systemtap/runtime/io.c:14,
>                  from /usr/local/share/systemtap/runtime/print.c:16,
>                  from /usr/local/share/systemtap/runtime/runtime.h:61,
>                  from /tmp/stap2iLdUc/stap_6318.c:30:
> /usr/local/share/systemtap/runtime/transport/procfs.c: In function
`_stp_proc_read':
> /usr/local/share/systemtap/runtime/transport/procfs.c:35: error: incompatible
> types in assignment
> /usr/local/share/systemtap/runtime/transport/procfs.c:36: error: incompatible
> types in assignment
> In file included from /usr/local/share/systemtap/runtime/io.c:14,
>                  from /usr/local/share/systemtap/runtime/print.c:16,
>                  from /usr/local/share/systemtap/runtime/runtime.h:61,
>                  from /tmp/stap2iLdUc/stap_6318.c:30:
> /usr/local/share/systemtap/runtime/transport/transport.c: In function
> `_stp_handle_buf_info':
> /usr/local/share/systemtap/runtime/transport/transport.c:86: error: incompatible
> types in assignment
> /usr/local/share/systemtap/runtime/transport/transport.c:87: error: incompatible
> types in assignment
> make[1]: *** [/tmp/stap2iLdUc/stap_6318.o] Error 1
> make: *** [_module_/tmp/stap2iLdUc] Error 2
> make: Leaving directory `/usr/src/kernels/2.6.9-30.EL-ppc64'
> Pass 4: compiled C into "stap_6318.ko" in 2820usr/220sys/2893real ms.
> Pass 4: compilation failed.  Try again with more '-v' (verbose) options.
> Running rm -rf /tmp/stap2iLdUc

Hmm, where did you put the #define?

I get these warnings if I put it at the bottom of relayfs.h, but putting it at
the top, just above 

#ifdef RELAYFS_VERSION_GE_4
#include <linux/relayfs_fs.h>
...

it works fine for me...
Comment 11 Li Guanglei 2006-03-02 05:58:49 UTC
> Hmm, where did you put the #define?
> 
> I get these warnings if I put it at the bottom of relayfs.h, but putting it at
> the top, just above 
> 
> #ifdef RELAYFS_VERSION_GE_4
> #include <linux/relayfs_fs.h>
> ...
> 
> it works fine for me...

the file I used:

#ifndef _TRANSPORT_RELAYFS_H_ /* -*- linux-c -*- */
#define _TRANSPORT_RELAYFS_H_
#define RELAYFS_VERSION_GE_4 

/** @file relayfs.h
 * @brief Header file for relayfs transport
 */

#ifdef RELAYFS_VERSION_GE_4
#include <linux/relayfs_fs.h>
#else
#include "../relayfs/linux/relayfs_fs.h"
#endif /* RELAYFS_VERSION_GE_4 */

struct rchan *_stp_relayfs_open(unsigned n_subbufs,
                                unsigned subbuf_size,
                                int pid,
                                struct dentry **outdir);
void _stp_relayfs_close(struct rchan *chan, struct dentry *dir);

#endif /* _TRANSPORT_RELAYFS_H_ */

So is it due to the gcc version? My gcc is:
gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)
I checked the codes, and it is just a warning of the assignment:
int *ptr <--- static int *ptr

But I met another problem, I use my testcase to stress test systemtap:

-bash-3.00# ./test.sh -f  lgl.cfg  -I tapsets/tapsets1/           
The tapsets is tapsets/tapsets1/
don't probe app : dbench
TIMES : 1
TIMES : 2
probe app : dbench
TIMES : 1
TIMES : 2
error opening file stpd_cpu0.
ERROR: couldn't unlink percpu file stpd_cpu0: errcode = No such file or directory

Do you have any ideas of such errors? I never met it before.
I raise the MAXDSKIPPED when running my testcases
Comment 12 Tom Zanussi 2006-03-02 06:13:36 UTC
(In reply to comment #11)
> > Hmm, where did you put the #define?
> > 
> > I get these warnings if I put it at the bottom of relayfs.h, but putting it at
> > the top, just above 
> > 
> > #ifdef RELAYFS_VERSION_GE_4
> > #include <linux/relayfs_fs.h>
> > ...
> > 
> > it works fine for me...
> 
> the file I used:
> 
> #ifndef _TRANSPORT_RELAYFS_H_ /* -*- linux-c -*- */
> #define _TRANSPORT_RELAYFS_H_
> #define RELAYFS_VERSION_GE_4 
> 
> /** @file relayfs.h
>  * @brief Header file for relayfs transport
>  */
> 
> #ifdef RELAYFS_VERSION_GE_4
> #include <linux/relayfs_fs.h>
> #else
> #include "../relayfs/linux/relayfs_fs.h"
> #endif /* RELAYFS_VERSION_GE_4 */
> 
> struct rchan *_stp_relayfs_open(unsigned n_subbufs,
>                                 unsigned subbuf_size,
>                                 int pid,
>                                 struct dentry **outdir);
> void _stp_relayfs_close(struct rchan *chan, struct dentry *dir);
> 
> #endif /* _TRANSPORT_RELAYFS_H_ */
> 
> So is it due to the gcc version? My gcc is:
> gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)
> I checked the codes, and it is just a warning of the assignment:
> int *ptr <--- static int *ptr
> 

I'm using gcc 4.1.0

> But I met another problem, I use my testcase to stress test systemtap:
> 
> -bash-3.00# ./test.sh -f  lgl.cfg  -I tapsets/tapsets1/           
> The tapsets is tapsets/tapsets1/
> don't probe app : dbench
> TIMES : 1
> TIMES : 2
> probe app : dbench
> TIMES : 1
> TIMES : 2
> error opening file stpd_cpu0.
> ERROR: couldn't unlink percpu file stpd_cpu0: errcode = No such file or directory
> 
> Do you have any ideas of such errors? I never met it before.
> I raise the MAXDSKIPPED when running my testcases

No, I haven't seen that before either.
Comment 13 Li Guanglei 2006-03-02 06:16:55 UTC
> -bash-3.00# ./test.sh -f  lgl.cfg  -I tapsets/tapsets1/           
> The tapsets is tapsets/tapsets1/
> don't probe app : dbench
> TIMES : 1
> TIMES : 2
> probe app : dbench
> TIMES : 1
> TIMES : 2
> error opening file stpd_cpu0.
> ERROR: couldn't unlink percpu file stpd_cpu0: errcode = No such file or directory
> 
> Do you have any ideas of such errors? I never met it before.
> I raise the MAXDSKIPPED when running my testcases
It may due to my testcase. I run stap in background and when benchmark tools
finished running, I just:
kill -s SIGINT -- stappid stpdpid
I should terminate stap & stpd in a right order. I think this is the cause.

I think this bug could be closed. 

*** This bug has been marked as a duplicate of 2406 ***