This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

full dwarf backtracing kernel to user


Hi,

I have been cleaning up the dwarf unwinder a bit, and after some small
fixes (all in git trunk now) it is now finally possible to unwind fully
from kernel space right into user space. This provides better user
backtraces when a probe point triggered in kernel space. Up till now,
when we wanted a user backtrace when the probe point was in kernel space
we took the set of registers saved by the kernel and erased all others
because they might have been changed by the kernel (and to not leak
possible private data). This often worked, but as soon as the unwinder
needed one of the registers possibly "trashed" and not saved by the
kernel we would be stuck. With the new setup it is now possible to get a
full user space register set to start the user space dwarf unwinder and
always do a full unwind. One example is setting a probe point on
syscall.close and then doing a print_ubacktrace(). Previously (on
x86_64) at least early on in a glibc wrapper we would get stuck because
we needed access to an unsaved register, with "full" dwarf unwinding we
can just continue unwinding into the user space application.

We can go a couple of ways to take advantage of this (currently I just
have a hack in runtime/stack-x86_64.c that just checks that when a
kernel backtrace finishes UNW_PC(info) == task_pt_regs(current)->ip,
sets up info->regs.sp = task_pt_regs(current)->sp and then continues
unwinding the user backtrace).

My thinking is that the (kernel) backtrace related functions shouldn't
change at all (except to make sure that non-x86 arches also use the
dwarf unwinder, that is on my list next). All ubacktrace related
functions should check whether they are called from a user context, in
which case we already have a full register set, otherwise it should do a
kernel unwind, without setting/printing anything except recording the
final register set, and do the ubacktrace using that set. This is
roughly equal to the sanitize logic in arch_unw_init_frame_info. We
should also introduce a full_backtrace function that gets/prints a full
kernel&user backtrace in one go (which people should use instead of a
print_backtrace(); print_ubacktrace() to save some work in the probe).
That last one should NOT be marked unprivileged since we don't do that
for kernel backtraces either.

The "risks" of doing a kernel-to-user unwind for a ubacktrace are that
it is slightly more work. But backtraces are already a lot of work, the
kernel portion often is not deep, and the recent cleanups made things
slightly more faster. Detecting we need a certain register in a
ubacktrace, then backtrack, do a kernel unwind anyway, then redo the
user backtrace is an alternative that is a little too tricky in my mind,
and might actually lead to more work inside the probe. The other risk is
that somehow bad kernel unwind data gets used and through the backtrace
some private register value gets exposed to unprivileged users. I think
we should be able to trust the kernel unwind data. And that risk seems
small since the register value then also needs to be somehow expressed
in the final PC value that the user gets access through. Opinions?

All, except one, [u]backtrace related tapset functions have been marked
EXPERIMENTAL so IMHO we can change them to behave slightly differently
from what they do currently. There are two exceptions.

There is one function, task_backtrace() function, which given a pointer
(long) to a task struct produces a hex backtrace of an arbitrary task,
that I am unsure about. Because it doesn't really fit with the rest of
the backtrace functions, which all act on the current probe context.
Sadly this function is not marked EXPERIMENTAL and people might already
rely on it.

There is also print_stack(), also not marked EXPERIMENTAL, which really
does the same as sprint_stack(), turning a hex string list into a symbol
resolved kernel backtrace. They just differ slightly in how they print
out stuff.

There is also the print_ubacktrace_brief () function, which does print
things slightly differently from normal, and surprisingly doesn't have
any sprint counterpart, so cannot be used in tapsets splitting
ubacktrace collecting (space separated hex string list) and then
printing (using the sprint functions).

A lot of code actually deals with the various formats. I am unsure if we
really want to maintain that. Opinions?

While we are looking at all this, are there any opinions on the whole
split of collection versus printing. AKA using [u]backtrace() to get a
space separated list of addresses, and then at a later time use
sprint_[u]backtrace() to get a string representing the symbol values
associated with those addresses? The problem with that whole approach is
that there are no checks at all whether those address strings actually
correspond to the current task. I haven't come up with a better
representation. It would help if we had some kind of "stack type",
because using strings to pass these things around and then have to parse
them is somewhat awkward. Ideas?

BTW. While hacking on this I hit a kernel crash a couple of times. After
some investigation it became clear that it is a bug in the transport
layer that would react badly to the buffers being full. Apparently this
has always been there, but since the backtrace functions create more
output than usual it triggers more often. I am working on a fix:
http://sourceware.org/bugzilla/show_bug.cgi?id=12960
The bug report contains a simple workaround (increate the buffer size to
something huge) that makes the issue almost never trigger for me. But it
can (and has) still happened, so for a real fix I will remove the
usleep() and add separate buggers for control messages than must never
be lost.

Cheers,

Mark


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]