I tried the latest CVS, and found it will crash my ppc64 box on both 2.6.17/2.6.18 kernel immediately after Pass 5. This should be introduced during recent checkin since I had no problems played with stap only one or two days ago. Here is the xmon message: 7:mon> e cpu 0x7: Vector: 300 (Data Access) at [c0000000ea05b8e0] pc: c000000000032018: .strcmp+0x10/0x24 lr: d00000000011aefc: ._stp_proc_write_cmd+0x908/0x9d0 [stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424] sp: c0000000ea05bb60 msr: 8000000000009032 dar: cfffffff0015700a dsisr: 40010000 current = 0xc0000000e869a000 paca = 0xc00000000056f200 pid = 15260, comm = staprun 7:mon> t [link register ] d00000000011aefc ._stp_proc_write_cmd+0x908/0x9d0 [stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424] [c0000000ea05bb60] d00000000011ae48 ._stp_proc_write_cmd+0x854/0x9d0 [stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424] (unreliable) [c0000000ea05bcf0] c0000000000c1768 .vfs_write+0xe0/0x1b0 [c0000000ea05bd90] c0000000000c1910 .sys_write+0x4c/0x8c [c0000000ea05be30] c00000000000861c syscall_exit+0x0/0x40 --- Exception: c01 (System Call) at 000000000fc3c9d4 SP (ff8ba310) is in userspace 7:mon> r R00 = 0000000000003a19 R16 = 0000000008028c78 R01 = c0000000ea05bb60 R17 = 0000000008028cb0 R02 = c00000000073f5a8 R18 = 0000000000000000 R03 = 000000000000005f R19 = 0000000000000000 R04 = cfffffff00157009 R20 = 00000000f7f90008 R05 = d00000000011cf60 R21 = 00000000f7e8f00c R06 = 7461727400000000 R22 = 000000001001bcd0 R07 = d000000000134b40 R23 = 0000000000003a1a R08 = 0000000000003a1a R24 = 00000000f7f9000c R09 = d000000000124a28 R25 = d000000000134b40 R10 = d0000000000fe1a0 R26 = d00000000011cf60 R11 = c000000000032008 R27 = 0000000000041df5 R12 = d00000000011be40 R28 = c0000001e78dde80 R13 = c00000000056f200 R29 = 0000000000003a19 R14 = 0000000000000001 R30 = d00000000012bd00 R15 = 0000000000000000 R31 = d0000000000c4000 pc = c000000000032018 .strcmp+0x10/0x24 lr = d00000000011aefc ._stp_proc_write_cmd+0x908/0x9d0 [stap_a11ddcd2df661cc3454d0dcaf7e4fa7c_424] msr = 8000000000009032 cr = 44000444 ctr = c000000000032008 xer = 0000000020000020 trap = 300 dar = cfffffff0015700a dsisr = 40010000 7:mon>
I changed the subject. staprun shouldn't cause the system crash. I tried the latest CVS on i386/2.6.18.1 and x86_64/2.6.18.1, and they all worked well. So it seems that this is a ppc64 specific error.
This is likely due to something I did. I have no idea why it would not work on ppc64 but I'll try to find a test machine and see.
It would be helpful if you could send me the output of "stap -DDEBUG foo.stp" where foo.stp is any script. The output I am interested in will be in /var/log/messages.
(In reply to comment #3) > It would be helpful if you could send me the output of > "stap -DDEBUG foo.stp" > where foo.stp is any script. The output I am interested in will be in > /var/log/messages. > This the /var/log/message I got on 2.6.18.1: Nov 4 08:45:34 ltc7 sshd(pam_unix)[11725]: session opened for user root by root(uid=0) Nov 4 08:46:25 ltc7 kernel: _stp_transport_init:303 transport_init from 11939 11940 Nov 4 08:46:25 ltc7 kernel: _stp_proc_write_cmd:111 STP_TRANSPORT_INFO 24 24 Nov 4 08:46:25 ltc7 kernel: _stp_transport_open:258 stp_transport_open: 0 Mb buffer. target=0 Nov 4 08:46:25 ltc7 kernel: _stp_transport_open:263 transport_mode=1 Nov 4 08:46:25 ltc7 kernel: _stp_handle_start:102 stp_handle_start pid=11939 Nov 4 08:49:20 ltc7 syslogd 1.4.1: restart. And xmon caught this error message when kernel crash: ltc7.cn.ibm.com login: _stp_transport_init:303 transport_init from 15783 15784 _stp_proc_write_cmd:111 STP_TRANSPORT_INFO 24 24 _stp_transport_open:258 stp_transport_open: 0 Mb buffer. target=0 _stp_transport_open:263 transport_mode=1 _stp_handle_start:102 stp_handle_start pid=15783 _stp_do_symbols:181 num=14874 datasize=269813 _stp_do_symbols:193 got stap_symbols, count=118992 _stp_do_symbols:199 got symbol data, count=269813 Unable to handle kernel paging request for data at address 0xcfffffff0015700a Faulting instruction address: 0xc000000000032018 cpu 0x3: Vector: 300 (Data Access) at [c0000000debdf8e0] pc: c000000000032018: .strcmp+0x10/0x24 lr: d00000000011ae84: ._stp_proc_write_cmd+0xa3c/0xb54 [stap_2a1ed9956a52372 28fe996452ccd248b_1067] sp: c0000000debdfb60 msr: 8000000000009032 dar: cfffffff0015700a dsisr: 40010000 current = 0xc0000000e31c1000 paca = 0xc00000000056ea00 pid = 15783, comm = staprun enter ? for help It looks like the memory pointed by the second argument of strcmp is invalid, but I am not sure who calls strcmp.
I tried another stp script: probe begin {} and it will still crash my ppc64. Further debugging showed the following call path will cause the crash: send_data @ runtime/stpd/symbols.c:216 --> write(_stp_proc_write_cmd) @ runtime/stpd/symbols.c:27 --> _stp_do_symbols @ runtime/transport/procfs.c:100 --> _stp_kallsyms_lookup_name @ runtime/transport/symbols.c:207 --> strcmp @ runtime/sym.c:67 So the invalid memory access to s->symbol by strcmp caused this crash. I added ab access_ok to check the validity of s->symbol: RCS file: /cvs/systemtap/src/runtime/sym.c,v retrieving revision 1.12 diff -u -r1.12 sym.c --- runtime/sym.c 2 Nov 2006 18:37:00 -0000 1.12 +++ runtime/sym.c 6 Nov 2006 06:40:14 -0000 @@ -64,6 +64,11 @@ unsigned num = _stp_modules[0]->num_symbols; while (num--) { + if(!access_ok(VERIFY_READ, s->symbol, 1)) + { + kbug("Access error for s->symbol:%p\n", s->symbol); + continue; + } if (strcmp(name, s->symbol) == 0) return s->addr; s++; and it will print out a lot of error message like: _stp_kallsyms_lookup_name:69 Access error for s->symbol:cfffffff000f700a
I think the wrong setting of _stp_modules[0]->symbols caused the crash. I tried kdump to debug this problem, but kdump can't even be triggered when kernel crashed.
The problem is that stpd is compiled in 32-bit mode and is passing addresses to the kernel, which thinks addresses are 64-bits. Compiling staprun with -m64 works for me. I'm working on a better solution where situations like this are handled automatically between staprun and the kernel. Until then, please try rebuilding staprun with -m64.
Fixed in CVS.