The systemtap.sample/poll_map.stp test built with the current systemtap translator (20061222) dies on FC6 machines. The test is getting a NULL dereference when running: BUG: unable to handle kernel NULL pointer dere8 printing eip: c0460579 *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /module/uhci_hcd/sections/.text Modules linked in: stap_a428a285aca60807a64f09f4e76207fd_35809(U) autofs4 hidp dCPU: 0 EIP: 0060:[<c0460579>] Not tainted VLI EFLAGS: 00210202 (2.6.18-1.2868.fc6 #1) Commenting out the entire probe kernel.function( "sys_*" ) avoids crashing the machine. However, just commenting out the body of the probe kernel.function( "sys_*" ) still crashes.
I cannot reproduce this on i686 UP nor x86-64. Please specify your platform. Have you tried gathering a kdump image?
This is an IBM T41 Thinkpad with a Pentium M processor running FC6 updated via yum. The machine has has 512M of memory. The problem is still triggered on the system with the latest checkout of systemtap: BUG: unable to handle kernel NULL pointer dere8 printing eip: c0460579 *pde = 06579067 Oops: 0000 [#1] SMP last sysfs file: /module/uhci_hcd/sections/.text Modules linked in: stap_a058fcd7029f5c9bb738cfca0ac4c0fc_35799(U) autofs4 hidp d CPU: 0 EIP: 0060:[<c0460579>] Not tainted VLI EFLAGS: 00210202 (2.6.18-1.2868.fc6 #1) I don't yet have a kdump from the problem.
I added some text to http://sourceware.org/systemtap/wiki/HowToReportBugs to help gather needed info. See the bottom few points of 'System crashes'.
I have thought about this a bit more and I am wondering if the elfutils-0.123 on the fc6 machine might be causing the problem. The systemtap is built using elfutils 0.124 and has it's own shared library installed in a local directory. Looking at the position of the EIP it looks like it is in sys_munlockall of this kernel. However, having a probe that only instruments that one function doesn't crash. The probe is set at 0xc046054aUL the EIP is reported at c0460579: c0460549 <sys_munlockall>: c0460549: 53 push %ebx c046054a: 89 e0 mov %esp,%eax c046054c: 25 00 f0 ff ff and $0xfffff000,%eax c0460551: 8b 00 mov (%eax),%eax c0460553: 8b 80 84 00 00 00 mov 0x84(%eax),%eax c0460559: 83 c0 38 add $0x38,%eax c046055c: e8 e7 8e fd ff call c0439448 <down_write> c0460561: 31 c0 xor %eax,%eax c0460563: e8 65 fd ff ff call c04602cd <do_mlockall> c0460568: 89 c3 mov %eax,%ebx c046056a: 89 e0 mov %esp,%eax c046056c: 25 00 f0 ff ff and $0xfffff000,%eax c0460571: 8b 00 mov (%eax),%eax c0460573: 8b 80 84 00 00 00 mov 0x84(%eax),%eax c0460579: 83 c0 38 add $0x38,%eax c046057c: e8 b1 8e fd ff call c0439432 <up_write> c0460581: 89 d8 mov %ebx,%eax c0460583: 5b pop %ebx c0460584: c3 ret
Get exactly the same crash with using the stock systemtap rpm with the poll_map.stp. The machine has virtually no load on it when the example is run (and crashes). This machine is normal installation of fedora core 6 updated with "yum update". What was the i686 configuration used to attempt to replicate this problem? Below are details about the machine. $ rpm -q systemtap gcc elfutils systemtap-0.5.10-1.fc6 gcc-4.1.1-30 elfutils-0.123-1.fc6 $ uname -a Linux montague.devel.redhat.com 2.6.18-1.2868.fc6 #1 SMP Fri Dec 15 17:31:29 EST 2006 i686 i686 i386 GNU/Linux [wcohen@montague systemtap.samples]$ more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 9 model name : Intel(R) Pentium(R) M processor 1600MHz stepping : 5 cpu MHz : 1594.855 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr mce cx8 mtrr pge mca cmov pat clflush d ts acpi mmx fxsr sse sse2 tm pbe up est tm2 bogomips : 3190.87 $ free total used free shared buffers cached Mem: 514652 236688 277964 0 32188 150120 -/+ buffers/cache: 54380 460272 Swap: 1050800 0 1050800 $ stap -v -k /home/wcohen/systemtap_write/src/testsuite/systemtap.samples/poll_map.stp Pass 1: parsed user script and 53 library script(s) in 330usr/10sys/391real ms. Pass 2: analyzed script: 3 probe(s), 2 function(s), 2 global(s) in 300usr/30sys/1889real ms. Pass 3: translated to C into "/tmp/stapvZHsjn/stap_2669.c" in 80usr/30sys/204real ms. Pass 4: compiled C into "stap_2669.ko" in 2860usr/260sys/4843real ms. Pass 5: starting run. The same messages appears on the console: BUG: unable to handle kernel NULL pointer dere8 printing eip: c0460579 *pde = 0e445067 Oops: 0000 [#1] SMP last sysfs file: /block/hda/removable Modules linked in: stap_2669(U) autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_cd CPU: 0 EIP: 0060:[<c0460579>] Not tainted VLI EFLAGS: 00210202 (2.6.18-1.2868.fc6 #1)
The stock kdump in the fc6 doesn't work with Pentium M because the processor doesn't have the PAE support and the fc6 kernel-kdump requires PAE support. :( Suspect the problem is related to Pentium M or the laptop environment.
Narrowed down the problem to working with FC5 kernel but failing with the FC6 kernel. Looking through the archive of test results the poll_map.stp test last ran successfully was stap_testing_200611070930/obj/testsuite/systemtap.log:PASS: poll_map (1) The source code for this is the same as the one that is currently crashing. Attempted to use the stap translator built at that time. It crashes in the same way. /home/wcohen/stap_testing_200611070930/install/bin/stap -v /home/wcohen/stap_testing_200611070930/src/testsuite/systemtap.samples/poll_map.stp When the switch on the machine from FC5 to FC6 occurred the test stopped working. Trying the FC5 kernel, 2.6.18-1.2200.fc5, on the machine and things work. It appears to be an issue with the kernel. 2.6.18-1.2200.fc5 i686 kernel works, but the 2.6.18-1.2869.fc6 i686 kernel fails.
Built a kernel locally with the same configuration as the stock kernel from the source code from kernel-2.6.18-1.2869.fc6.src.rpm. The test ran fine with the locally built kernel. There are differences in the locations of functions in the System.map files.
This is caused by FC6 installing a i586 kernel and kernel-devel rather than a i686 kernel and kernel-devel on the laptop and i686 debuginfo being installed. Systemtap checks that the debuginfo version but not the architecture. The mismatch causes some probes to be off. Can check that architecture information is reasonable with rpm -qa --queryformat "%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n" | grep kernel The kernel-debuginfo, kernel, and kernel-devel should be for the same arch. It appears that anaconda installed the incorrect architecture kernel on the machine. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=211941
It is unfortunate that there appears to be no way, based on the ELF files only, to tell whether they come from the i586 vs. i686 builds. If it were possible, then the recently added checks in tapsets.cxx (query_module) could look for it.