This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

user backtrace from kernel context status


Hi,

Some status update on our ability to produce user backtraces from kernel
space context. It is now sometimes possible to get (a partial) user
space backtrace. For those architectures (i686 and x86_64) that use the
dwarf unwinder. There are some limitations though.

Some examples:

Syscalls.

$ stap -d /bin/ls --ldd -e 'probe syscall.getdents
  { log(pn()); print_ubacktrace(); }' -c /bin/ls
syscall.getdents
 0x000000384f0a2f65 : __getdents+0x15/0x90 [libc-2.12.so]
 0x000000384f0a2962 : readdir64+0x82/0xdf [libc-2.12.so]
 0x0000000000407f1f : print_dir+0x1df/0x6f0 [ls]
 0x000000000040898d : main+0x55d/0x1900 [ls]
 0x000000384f01ec5d : __libc_start_main+0xfd/0x1d0 [libc-2.12.so]
 0x0000000000402799 : _start+0x29/0x2c [ls]

This example works for x86_64, but not for i686 because we don't track
the vdso yet (PR10080).

Timers.

$ stap -d /bin/sort --ldd -e 'probe timer.profile
  { if (execname() == "sort")
    { log(pn()); print_ubacktrace(); } }' \
  -c '/bin/sort /usr/share/dict/words > /dev/null'

timer.profile
 0x00913b18 : strcoll_l+0x158/0xeb0 [libc-2.12.so]
 0x0090f3f1 : strcoll+0x31/0x40 [libc-2.12.so]
 0x080568c3 : memcoll+0x73/0x150 [sort]
 0x08053d0b : xmemcoll+0x3b/0x150 [sort]
 0x0804a60b : compare+0xeb/0xf0 [sort]
 0x0804cb20 : sortlines+0xb0/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804c944 : sortlines_temp+0x54/0x180 [sort]
 0x0804c92e : sortlines_temp+0x3e/0x180 [sort]
 0x0804cac8 : sortlines+0x58/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804fcba : .L799+0xe9f/0x1a55 [sort]
[... lots more ...]

We do seem to lose track at the end of the trace, don't know why yet. On
x86_64 things look even nicer (all the way down to _start) but we seem
unable to unwind through some glibc functions like strcoll_l (I suspect
bad unwind data, but haven't inspected yet).

The above uses the fact that we now "know" when the full user register
set is available. The probe handlers set a new CONTEXT->regflags (except
the perf probes, I didn't know which events set which regs. If someone
more knowledgeable about the perf events might take a look.) If that
isn't set, we know to use task_pt_regs() and have a new "sanitizing"
mechanism in the dwarf unwinder to scrub any registers that aren't
reliable (this is for now just done by zeroing out a copy of the
pt_regs, it would be nicer to prime the unwinder state itself so it
marks those registers undefined). The heuristics are kind of crude:

/* Whether all user registers are valid. If not the pt_regs needs,
 * architecture specific, scrubbing before usage (in the unwinder).
 * XXX Currently very simple heuristics, just check arch. Should
 * user task and user pt_regs state.
 *
 * See arch specific "scrubbing" code in runtime/unwind/<arch>.h
 */
static inline int _stp_task_pt_regs_valid(struct task_struct *task,
                                          struct pt_regs *uregs)
{
/* It would be nice to just use syscall_get_nr(task, uregs) < 0
 * but that might trigger false negatives or false positives
 * (bad syscall numbers or syscall tracing being in effect).
 */
#if defined(__i386__)
  return 1; /* i386 has so little registers, all are saved. */
#elif defined(__x86_64__)
  return 0;
#endif
  return 0;
}

But it seems to work OK in the little tests I did.

In theory we should now also try to get unwinding to "the red
line" (kernel space till the border into user space) to work if the
above fails (there were several fixes to the dwarf unwinder and the
context.exp backtrace test now rejects any "inexact" frames). But I
haven't yet tested against a kernel that had all CFI build in the
debuginfo (fedora rawhide should have it though). And it will need even
more cleanup of the unwinder/symbol/stack-printing mechanism. Printing
the stack and unwinding are still somewhat intertwined, but a lot of
progress has been made to make them more separate. There are now also
tapset functions that return the backtrace as strings for even more
powerful scripts.

The biggest hurdle for users is making the "task finder" keep track of
the vmas of the relevant processes and making sure the unwind data is
available. In the above -d <mainprog> --ldd and -c <mainprog> does that
trick. It is slightly harder to get it all setup for "random" processes.
umodname(uaddr()) can sometimes help to see what stap would like. Also
the backtrace if available should end with the module/shared library
name (but that relies on the vma tracker to figure out that particular
process vma maps should be tracked.

Cheers,

Mark


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]