This is the mail archive of the
mailing list for the systemtap project.
Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
- From: Mark Wielaard <mjw at redhat dot com>
- To: Stefan Hajnoczi <stefanha at gmail dot com>
- Cc: "Frank Ch. Eigler" <fche at redhat dot com>, Julien Desfossez <julien dot desfossez at polymtl dot ca>, dominique dot toupin at ericsson dot com, ltt-dev at lists dot casi dot polymtl dot ca, Mathieu Desnoyers <mathieu dot desnoyers at efficios dot com>, systemtap at sources dot redhat dot com
- Date: Wed, 16 Feb 2011 11:56:18 +0100
- Subject: Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
- References: <4D5AA164.firstname.lastname@example.org> <email@example.com> <AANLkTi=Nsy6fXE9=Njxs9LPHuohHzf=q5kD+fK765Rht@mail.gmail.com>
On Tue, 2011-02-15 at 17:00 +0000, Stefan Hajnoczi wrote:
> On Tue, Feb 15, 2011 at 4:26 PM, Frank Ch. Eigler <firstname.lastname@example.org> wrote:
> > (One may imagine a future version of systemtap where scripts that
> > happen to independently probe single processes are executed with a
> > pure userspace backend, but this is not in our immediate roadmap.)
> What is the fundamental mechanism that UST and SystemTap use for tracing?
> e.g. Here's a guess:
> UST: a conditional function call within the same process
> SystemTap: a software interrupt on x86
> I don't know the implementations details but would be interested in
> understanding this.
I don't know the precise implementation details for ltt. But for
SystemTap you could divide the "tracing" process into a couple of steps:
1) The probe marking. The way you embed where you can place probes and
how to get at arguments/context of the probe. For userspace probes
SystemTap mainly relies on two mechanisms:
- dwarf debuginfo. This is the same mechanism debuggers use. It is a
very low level description of how the source program maps to the
binary. Through it you can determine locations for probes based on
source lines, function names, etc. and get a description of how to
get at local variables and arguments. Advantage is that it is already
there (when compiled -g), so you don't need to do anything special.
Downside is that it is pretty low level, so you do need to know a bit
about the program structure before you can "trace" effectively.
Recent advancements in gcc made the dwarf debuginfo pretty reliable.
- sdt markers. This is a mechanism also employed by dtrace (although
the way the markers and arguments are embedded is slightly different,
this is an implementation detail though). A program #include
<sys/sdt.h> and places PROBE markers in their source code to indicate
"high-level events" and relevant arguments for that event.
The macros get translated to special code that places the name,
address and where to find the arguments into a special elf note.
Advantage is that as a "trace user" you get an overview of high level
events that might be interested to introspect. Disadvantage is that
the programmer needs to explicitly embed them in their program (but
since dtrace and now gdb can also hook onto them they are getting
used more and more).
2) The probe and context selection. In a systemtap stap script you
list all places/events you want to place a probe on. These can be
low level kernel events (tracepoints, based on kernel debuginfo,
timers, perf events, etc) or user level events (based on the dwarf
debuginfo or sdt markers placed in the program). Then for each (group
of) probe events, you write a handler listing the context you are
interested in (variables, arguments, etc.). These can then be used to
filter and/or log the event (see under 5. The actual "trace").
4) Hooking onto the probe. Based on the stap script you provide the
systemtap runtime decides which addresses to place probes on (or hook
into event notifiers). It also extracts the location of each context
variable and/or parameter used in the probe handler for that
location. Currently for each user space address derived (which could
be multiple if the probe point is inlined in various places) it uses
uprobes to place a breakpoint instruction at that location and
inserts a callback handler to the handler responsible for that probe
event. All the nitty-gritty of placing the probes and handling the
software interrupt is delegated to uprobes (it saves a full roundtrip
user/kernel/user necessary with for example ptrace), which is being
pushed into the upstream kernel so it can be used by others like perf
and gdb in the future. But you could imagine hooking being done
through other mechanisms, like in-process functional calls in the
user process. If the code injection techniques of ltt are reusable
that would be a very cool idea.
5) The actual "trace"/data gathering step. Depending on the stap
handler you wrote for the probe the SystemTap runtime (called
through the probe hook) will extract the context variables
and/or parameters you are interested in. They are then used for
filtering (based on the conditionals used in your handler) and
then lets you either assign derived values to global (script)
variables or statistical containers, or make you log the event
and/or some of the context. Basically you write a log or printf
statement in your handler when you want to "trace" it. Depending
on how you invoked stap it is then placed in a file or some buffer
through procfs, relayfs, debugfs or ring_buffers. Alternatively
you can write an "end" handler that just spits out the data you
accumulated and stored in the script variables and statistics
(so as not to have to output anything at all during the probe
event itself to save data output and processing time).
Hope that helps. And if someone could give a similar overview of ltt
then we could see how we can more easily mix and match these various
steps in the future. Since it seems the mechanisms used are nicely