Bug 3785 - mismatched kernel and debuginfo architecture causing system crashes
Summary: mismatched kernel and debuginfo architecture causing system crashes
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: kprobes (show other bugs)
Version: unspecified
: P2 critical
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-12-22 16:46 UTC by William Cohen
Modified: 2007-01-09 15:59 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description William Cohen 2006-12-22 16:46:17 UTC
The systemtap.sample/poll_map.stp test built with the current systemtap
translator (20061222) dies on FC6 machines. The test is getting a NULL
dereference when running:

BUG: unable to handle kernel NULL pointer dere8 printing eip:
c0460579
*pde = 00000000
Oops: 0000 [#1]
SMP
last sysfs file: /module/uhci_hcd/sections/.text
Modules linked in: stap_a428a285aca60807a64f09f4e76207fd_35809(U) autofs4 hidp
dCPU:    0
EIP:    0060:[<c0460579>]    Not tainted VLI
EFLAGS: 00210202   (2.6.18-1.2868.fc6 #1)

Commenting out the entire probe kernel.function( "sys_*" ) avoids crashing the
machine. However, just commenting out the body of the  probe kernel.function(
"sys_*" ) still crashes.
Comment 1 Frank Ch. Eigler 2006-12-31 04:15:08 UTC
I cannot reproduce this on i686 UP nor x86-64.
Please specify your platform.  Have you tried
gathering a kdump image?
Comment 2 William Cohen 2007-01-01 16:51:04 UTC
This is an IBM T41 Thinkpad with a Pentium M processor running FC6 updated via
yum. The machine has has 512M of memory.

The problem is still triggered on the system with the latest checkout of systemtap:

BUG: unable to handle kernel NULL pointer dere8
 printing eip:                                                                  
c0460579                                                                        
*pde = 06579067                                                                 
Oops: 0000 [#1]                                                                 
SMP                                                                             
last sysfs file: /module/uhci_hcd/sections/.text                                
Modules linked in: stap_a058fcd7029f5c9bb738cfca0ac4c0fc_35799(U) autofs4 hidp d
CPU:    0                                                                       
EIP:    0060:[<c0460579>]    Not tainted VLI                                    
EFLAGS: 00210202   (2.6.18-1.2868.fc6 #1)

I don't yet have a kdump from the problem.
Comment 3 Frank Ch. Eigler 2007-01-01 17:37:10 UTC
I added some text to http://sourceware.org/systemtap/wiki/HowToReportBugs to
help gather needed info.  See the bottom few points of 'System crashes'.
Comment 4 William Cohen 2007-01-01 19:39:45 UTC
I have thought about this a bit more and I am wondering if the elfutils-0.123 on
the fc6 machine might be causing the problem. The systemtap is built using
elfutils 0.124 and has it's own shared library installed in a local directory.

Looking at the position of the EIP it looks like it is in sys_munlockall of this
kernel. However, having a probe that only instruments that one function doesn't
crash. The probe is set at 0xc046054aUL the EIP is reported at c0460579:

c0460549 <sys_munlockall>:
c0460549:	53                   	push   %ebx
c046054a:	89 e0                	mov    %esp,%eax
c046054c:	25 00 f0 ff ff       	and    $0xfffff000,%eax
c0460551:	8b 00                	mov    (%eax),%eax
c0460553:	8b 80 84 00 00 00    	mov    0x84(%eax),%eax
c0460559:	83 c0 38             	add    $0x38,%eax
c046055c:	e8 e7 8e fd ff       	call   c0439448 <down_write>
c0460561:	31 c0                	xor    %eax,%eax
c0460563:	e8 65 fd ff ff       	call   c04602cd <do_mlockall>
c0460568:	89 c3                	mov    %eax,%ebx
c046056a:	89 e0                	mov    %esp,%eax
c046056c:	25 00 f0 ff ff       	and    $0xfffff000,%eax
c0460571:	8b 00                	mov    (%eax),%eax
c0460573:	8b 80 84 00 00 00    	mov    0x84(%eax),%eax
c0460579:	83 c0 38             	add    $0x38,%eax
c046057c:	e8 b1 8e fd ff       	call   c0439432 <up_write>
c0460581:	89 d8                	mov    %ebx,%eax
c0460583:	5b                   	pop    %ebx
c0460584:	c3                   	ret    
Comment 5 William Cohen 2007-01-01 20:10:30 UTC
Get exactly the same crash with using the stock systemtap rpm with the
poll_map.stp. The machine has virtually no load on it when the example is run
(and crashes).  This machine is normal installation of fedora core 6 updated
with "yum update". What was the i686 configuration used to attempt to replicate
this problem? Below are details about the machine.

$ rpm -q systemtap gcc elfutils
systemtap-0.5.10-1.fc6
gcc-4.1.1-30
elfutils-0.123-1.fc6
$ uname -a
Linux montague.devel.redhat.com 2.6.18-1.2868.fc6 #1 SMP Fri Dec 15 17:31:29 EST
2006 i686 i686 i386 GNU/Linux
[wcohen@montague systemtap.samples]$ more /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 9
model name      : Intel(R) Pentium(R) M processor 1600MHz
stepping        : 5
cpu MHz         : 1594.855
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 mtrr pge mca cmov pat clflush d
ts acpi mmx fxsr sse sse2 tm pbe up est tm2
bogomips        : 3190.87
$ free
             total       used       free     shared    buffers     cached
Mem:        514652     236688     277964          0      32188     150120
-/+ buffers/cache:      54380     460272
Swap:      1050800          0    1050800

$ stap -v -k
/home/wcohen/systemtap_write/src/testsuite/systemtap.samples/poll_map.stp
Pass 1: parsed user script and 53 library script(s) in 330usr/10sys/391real ms.
Pass 2: analyzed script: 3 probe(s), 2 function(s), 2 global(s) in
300usr/30sys/1889real ms.
Pass 3: translated to C into "/tmp/stapvZHsjn/stap_2669.c" in
80usr/30sys/204real ms.
Pass 4: compiled C into "stap_2669.ko" in 2860usr/260sys/4843real ms.
Pass 5: starting run.

The same messages appears on the console:

BUG: unable to handle kernel NULL pointer dere8
 printing eip:                                                                  
c0460579                                                                        
*pde = 0e445067                                                                 
Oops: 0000 [#1]                                                                 
SMP                                                                             
last sysfs file: /block/hda/removable                                           
Modules linked in: stap_2669(U) autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_cd
CPU:    0                                                                       
EIP:    0060:[<c0460579>]    Not tainted VLI                                    
EFLAGS: 00210202   (2.6.18-1.2868.fc6 #1)    
Comment 6 William Cohen 2007-01-02 21:31:37 UTC
The stock kdump in the fc6 doesn't work with Pentium M because the processor
doesn't have the PAE support and the fc6 kernel-kdump requires PAE support. :(

Suspect the problem is related to Pentium M or the laptop environment.

Comment 7 William Cohen 2007-01-05 15:46:10 UTC
Narrowed down the problem to working with FC5 kernel but failing with the FC6
kernel.

Looking through the archive of test results the poll_map.stp test last ran
successfully was 

stap_testing_200611070930/obj/testsuite/systemtap.log:PASS: poll_map (1)

The source code for this is the same as the one that is currently crashing.
Attempted to use the stap translator built at that time. It crashes in the same way.

/home/wcohen/stap_testing_200611070930/install/bin/stap -v
/home/wcohen/stap_testing_200611070930/src/testsuite/systemtap.samples/poll_map.stp 

When the switch on the machine from FC5 to FC6 occurred the test stopped
working. Trying the FC5 kernel, 2.6.18-1.2200.fc5, on the machine and things
work. It appears to be an issue with the kernel.

2.6.18-1.2200.fc5 i686 kernel works, but the 2.6.18-1.2869.fc6 i686 kernel fails.



Comment 8 William Cohen 2007-01-05 20:42:36 UTC
Built a kernel locally with the same configuration as the stock kernel from the
source code from kernel-2.6.18-1.2869.fc6.src.rpm. The test ran fine with the
locally built kernel. There are differences in the locations of functions in the
System.map files.
Comment 9 William Cohen 2007-01-09 15:13:58 UTC
This is caused by FC6 installing a i586 kernel and kernel-devel rather than a
i686 kernel and kernel-devel on the laptop and i686 debuginfo being installed.
Systemtap checks that the debuginfo version but not the architecture. The
mismatch causes some probes to be off. Can check that architecture information
is reasonable with

 rpm -qa --queryformat "%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n" | grep kernel

The kernel-debuginfo, kernel, and kernel-devel should be for the same arch.

It appears that anaconda installed the incorrect architecture kernel on the machine.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=211941

Comment 10 Frank Ch. Eigler 2007-01-09 15:59:04 UTC
It is unfortunate that there appears to be no way, based on the ELF files only,
to tell whether they come from the i586 vs. i686 builds.  If it were possible,
then the recently added checks in tapsets.cxx (query_module) could look for it.