This is the mail archive of the mailing list for the GDB project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Linux kernel problem -- food for thoughts

Gdb is currently having a 'little problem' backtracing out of system
calls in x86 kernels which support NPTL. I think the current public
2.5 kernel would make this problem show up.

Right now, if you are in system calls the backtrace will show up as:

 0xffffe002 in ??

Here is an explanation of the problem that Roland has provided:

Previously asm or C code in libc entered the kernel by setting some
registers and using the "int $0x80" instruction.  e.g.

00000000 <__getpid>:
   0:	b8 14 00 00 00       	mov    $0x14,%eax
   5:	cd 80                	int    $0x80
   7:	c3                   	ret    

That is the function called __getpid in libc, the pre-NPTL build.  (In the
shared library you will see this if you've run with LD_ASSUME_KERNEL=2.4.1
so that /lib/i686/ is what you are using.)

In the new libc (/lib/tls/, that function looks like this:

00000000 <__getpid>:
   0:	b8 14 00 00 00       	mov    $0x14,%eax
   5:	65 ff 15 10 00 00 00 	call   *%gs:0x10
   c:	c3                   	ret    

%gs:0x10 is a location that has been initialized to a kernel-supplied
special entry point address.  In the current kernels, that address is
always 0xffffe000.  But that is not part of the ABI, which is why it's
indirect instead of a literal "call 0xffffe000".  The kernel supplies the
actual entry point address to libc at startup time, and nothing in the
kernel-user interface prevents it from using a different address in each
process if it chose to.

The reason for this is that there can be multiple ways to enter the kernel,
not just the "int $0x80" trap instruction.  Some kernels on some hardware
may use a different method that performs better.  By using this
kernel-supplied entry point address, no user code has to be changed to
select the method.  It's entirely the kernel's choice.

In all the RH kernels we have right now, the entry point page contains:

	0xffffe000:	int $0x80
	0xffffe002:	ret

But user code cannot presume what this code sequence looks like exactly.
It will be some sequence of register and stack moves and special trap
instructions, but you have to disassemble to know exactly.  In the case
above, the PC value seen while a thread is in the kernel is 0xffffe002.
You can disassemble the "ret" there and see that you have to pop the PC off
the stack to recover the caller's frame.  

Another example of what this code might look like when you disassemble it is:

	0xffffe000:	push   %ecx
	0xffffe001:	push   %edx
	0xffffe002:	push   %ebp
	0xffffe003: 	mov    %esp,%ebp
	0xffffe005: 	sysenter 
	0xffffe007:	nop    
	0xffffe008:	nop    
	0xffffe009:	nop    
	0xffffe00a:	nop    
	0xffffe00b:	nop    
	0xffffe00c:	nop    
	0xffffe00d:	nop    
	0xffffe00e: 	jmp    0xffffe003
	0xffffe010:	pop    %ebp
	0xffffe011:	pop    %edx
	0xffffe012:	pop    %ecx
	0xffffe013:	ret    

In this example, depending on what happened inside the kernel the PC you
usually see may be either 0xffffe00e or 0xffffe010.  If the process gets a
signal or you attach asynchronously or so forth, the PC might be at any of
the earlier instructions as well.  You cannot rely on exactly what the
sequence is, so you must be able to disassemble from where you are and
cope.  In this case you will most often see 0xffffe010, in which case you
need to pop those three registers and the PC off the stack to restore the
caller's frame.

So, these cases are like a leaf function with no debugging info.  The
first solution idea was interpreting the epilogue code.  It will
probably be safe to assume that it looks like epilogue code normally
does, i.e. register pops and not any arbitrary instructions.

Another solution I was considering is to have the system somewhere provide
DWARF unwind info matching the possible PC addresses in the vsyscall page.
I am now pretty sure this is the way to go.  The recent development is that
NPTL now needs .eh_frame information for these PCs as well, and Ulrich has
made a kernel change to provide it.  The .eh_frame info for the vsyscall
PCs is on the same read-only kernel page.  The C library now uses this as
if the vsyscall page were a DSO with .eh_frame info to register, so that
exception-style unwinding from any valid PC in a magic entry point works.

So, there is a .eh_frame section available for this code, and getting it
from where it is into gdb can be done by hook or by crook.  I have the
impression that gdb turning an available .eh_frame section into happy
backtraces is something that might be expected real soon now.  
Sounds like a winner.

I think that elucidates all but the dreariest bits of the technical issues.
Now the practical questions.  Oh, one dreary bit: 83172 mostly talks about
the fact that ptrace refuses to read the 0xffffe000 page for you, which is
presumed a prerequisite for dealing with the real can of worms (unwinding).


I think right now the public 2.5 kernel has a fix to make the page
readable, and another one to provide the .eh_frame information. There
is no mechanism yet to make that debug info accessible to gdb.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]