3452 – stap crash ppc64

Bug 3452 - stap crash ppc64

Summary: stap crash ppc64

Status:	RESOLVED FIXED

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	runtime (show other bugs)
Version:	unspecified

Importance:	P1 critical
Target Milestone:	---
Assignee:	Martin Hunt

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-11-03 09:16 UTC by Li Guanglei
Modified:	2006-11-09 18:30 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Li Guanglei 2006-11-03 09:16:45 UTC

I tried the latest CVS, and found it will crash my ppc64 box on both
2.6.17/2.6.18 kernel immediately after Pass 5. This should be introduced during
recent checkin since I had no problems played with stap only one or two days ago.

Here is the xmon message:
7:mon> e
cpu 0x7: Vector: 300 (Data Access) at [c0000000ea05b8e0]
    pc: c000000000032018: .strcmp+0x10/0x24
    lr: d00000000011aefc: ._stp_proc_write_cmd+0x908/0x9d0
[stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424]
    sp: c0000000ea05bb60
   msr: 8000000000009032
   dar: cfffffff0015700a
 dsisr: 40010000
  current = 0xc0000000e869a000
  paca    = 0xc00000000056f200
    pid   = 15260, comm = staprun
7:mon> t
[link register   ] d00000000011aefc ._stp_proc_write_cmd+0x908/0x9d0
[stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424]
[c0000000ea05bb60] d00000000011ae48 ._stp_proc_write_cmd+0x854/0x9d0
[stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424] (unreliable)
[c0000000ea05bcf0] c0000000000c1768 .vfs_write+0xe0/0x1b0
[c0000000ea05bd90] c0000000000c1910 .sys_write+0x4c/0x8c
[c0000000ea05be30] c00000000000861c syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 000000000fc3c9d4
SP (ff8ba310) is in userspace
7:mon> r
R00 = 0000000000003a19   R16 = 0000000008028c78
R01 = c0000000ea05bb60   R17 = 0000000008028cb0
R02 = c00000000073f5a8   R18 = 0000000000000000
R03 = 000000000000005f   R19 = 0000000000000000
R04 = cfffffff00157009   R20 = 00000000f7f90008
R05 = d00000000011cf60   R21 = 00000000f7e8f00c
R06 = 7461727400000000   R22 = 000000001001bcd0
R07 = d000000000134b40   R23 = 0000000000003a1a
R08 = 0000000000003a1a   R24 = 00000000f7f9000c
R09 = d000000000124a28   R25 = d000000000134b40
R10 = d0000000000fe1a0   R26 = d00000000011cf60
R11 = c000000000032008   R27 = 0000000000041df5
R12 = d00000000011be40   R28 = c0000001e78dde80
R13 = c00000000056f200   R29 = 0000000000003a19
R14 = 0000000000000001   R30 = d00000000012bd00
R15 = 0000000000000000   R31 = d0000000000c4000
pc  = c000000000032018 .strcmp+0x10/0x24
lr  = d00000000011aefc ._stp_proc_write_cmd+0x908/0x9d0
[stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424]
msr = 8000000000009032   cr  = 44000444
ctr = c000000000032008   xer = 0000000020000020   trap =  300
dar = cfffffff0015700a   dsisr = 40010000
7:mon>

Comment 1 Li Guanglei 2006-11-03 13:26:21 UTC

I changed the subject. staprun shouldn't cause the system crash.

I tried the latest CVS on i386/2.6.18.1 and x86_64/2.6.18.1, and they all worked
well. So it seems that this is a ppc64 specific error.

Comment 2 Martin Hunt 2006-11-03 14:50:02 UTC

This is likely due to something I did. I have no idea why it would not work on
ppc64 but I'll try to find a test machine and see.

Comment 3 Martin Hunt 2006-11-03 15:09:50 UTC

It would be helpful if you could send me the output of 
"stap -DDEBUG foo.stp"
where foo.stp is any script. The output I am interested in will be in
/var/log/messages.

Comment 4 Li Guanglei 2006-11-04 01:02:43 UTC

(In reply to comment #3)
> It would be helpful if you could send me the output of 
> "stap -DDEBUG foo.stp"
> where foo.stp is any script. The output I am interested in will be in
> /var/log/messages.
> 

This the /var/log/message I got on 2.6.18.1:

Nov  4 08:45:34 ltc7 sshd(pam_unix)[11725]: session opened for user root by
root(uid=0)
Nov  4 08:46:25 ltc7 kernel: _stp_transport_init:303 transport_init from 11939 11940
Nov  4 08:46:25 ltc7 kernel: _stp_proc_write_cmd:111 STP_TRANSPORT_INFO 24 24
Nov  4 08:46:25 ltc7 kernel: _stp_transport_open:258 stp_transport_open: 0 Mb
buffer. target=0
Nov  4 08:46:25 ltc7 kernel: _stp_transport_open:263 transport_mode=1
Nov  4 08:46:25 ltc7 kernel: _stp_handle_start:102 stp_handle_start pid=11939
Nov  4 08:49:20 ltc7 syslogd 1.4.1: restart.

And xmon caught this error message when kernel crash:

ltc7.cn.ibm.com login: _stp_transport_init:303 transport_init from 15783 15784
_stp_proc_write_cmd:111 STP_TRANSPORT_INFO 24 24
_stp_transport_open:258 stp_transport_open: 0 Mb buffer. target=0
_stp_transport_open:263 transport_mode=1
_stp_handle_start:102 stp_handle_start pid=15783
_stp_do_symbols:181 num=14874 datasize=269813
_stp_do_symbols:193 got stap_symbols, count=118992
_stp_do_symbols:199 got symbol data, count=269813
Unable to handle kernel paging request for data at address 0xcfffffff0015700a
Faulting instruction address: 0xc000000000032018
cpu 0x3: Vector: 300 (Data Access) at [c0000000debdf8e0]
    pc: c000000000032018: .strcmp+0x10/0x24
    lr: d00000000011ae84: ._stp_proc_write_cmd+0xa3c/0xb54 [stap_2a1ed9956a52372
                                                                           
28fe996452ccd248b_1067]
    sp: c0000000debdfb60
   msr: 8000000000009032
   dar: cfffffff0015700a
 dsisr: 40010000
  current = 0xc0000000e31c1000
  paca    = 0xc00000000056ea00
    pid   = 15783, comm = staprun
enter ? for help

It looks like the memory pointed by the second argument of strcmp is invalid,
but I am not sure who calls strcmp.

Comment 5 Li Guanglei 2006-11-06 06:55:11 UTC

I tried another stp script:
  probe begin {}
and it will still crash my ppc64.

Further debugging showed the following call path will cause the crash:

send_data @ runtime/stpd/symbols.c:216
 --> write(_stp_proc_write_cmd) @ runtime/stpd/symbols.c:27
  --> _stp_do_symbols @ runtime/transport/procfs.c:100
   --> _stp_kallsyms_lookup_name @ runtime/transport/symbols.c:207
    --> strcmp @ runtime/sym.c:67

So the invalid memory access to s->symbol by strcmp caused this crash.
I added ab access_ok to check the validity of s->symbol:

RCS file: /cvs/systemtap/src/runtime/sym.c,v
retrieving revision 1.12
diff -u -r1.12 sym.c
--- runtime/sym.c       2 Nov 2006 18:37:00 -0000       1.12
+++ runtime/sym.c       6 Nov 2006 06:40:14 -0000
@@ -64,6 +64,11 @@
        unsigned num = _stp_modules[0]->num_symbols;

        while (num--) {
+               if(!access_ok(VERIFY_READ, s->symbol, 1))
+               {
+                       kbug("Access error for s->symbol:%p\n", s->symbol);
+                       continue;
+               }
                if (strcmp(name, s->symbol) == 0)
                        return s->addr;
                s++;

and it will print out a lot of error message like:
_stp_kallsyms_lookup_name:69 Access error for s->symbol:cfffffff000f700a

Comment 6 Li Guanglei 2006-11-07 05:44:17 UTC

I think the wrong setting of _stp_modules[0]->symbols caused the crash. I tried
kdump to debug this problem, but kdump can't even be triggered when kernel crashed.

Comment 7 Martin Hunt 2006-11-07 19:40:01 UTC

The problem is that stpd is compiled in 32-bit mode and is passing addresses to
the kernel, which thinks addresses are 64-bits.

Compiling staprun with -m64 works for me.

I'm working on a better solution where situations like this are handled
automatically between staprun and the kernel.  Until then, please try rebuilding
staprun with -m64.

Comment 8 Martin Hunt 2006-11-09 18:30:24 UTC

Fixed in CVS.