This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Some notes on translation

From: Tom Zanussi <zanussi at us dot ibm dot com>
To: "Frank Ch. Eigler" <fche at redhat dot com>
Cc: Tom Zanussi <zanussi at us dot ibm dot com>, systemtap at sources dot redhat dot com
Date: Fri, 25 Feb 2005 17:03:50 -0600
Subject: Re: Some notes on translation
References: <16925.3337.489506.767382@tut.ibm.com><20050225210509.GC27468@redhat.com>

Frank Ch. Eigler writes:
 > Hi -
 > 
 > 
 > > [...]
 > > - self->xxx means xxx is a thread-local variable
 > 
 > I'm unfond of the pointer syntax in the script language (see below),
 > but this particular case can be mapped easily in the parser to an
 > array index operation like "xxx[$pid]".

Right, I wasn't really meaning to suggest a syntax for any of this - I
just wrote down what came to me naturally while being too lazy to
look at the actual grammar ;-)

 > 
 > 
 > > - $xxx is shorthand for values to be substituted by runtime library
 > >   functions or probe variables, similar to Perl interpolation
 > 
 > We may need to consider a naming system that can be composed into
 > richer identifiers.  There are several types of variables to access:
 > - "macros" like "$timestamp", which map to snippets of code
 > - target-side variables: local (function parameters, locals), global
 > - probe-side special variables like your "$syscall_name"
 > 
 > > [...]
 > > probe syscall:exit("read")
 > > 	read_times[$syscall_name]
 > > [...]
 > 
 >   I am aware of no plausible run-time library function that can return
 >   the name of the current system call.  Rather, I imagine this sort of
 >   facility working by having a library of systemtap script fragments
 >   that provide definitions for probe points or helper variables:
 > 
 >   probe syscall("read") = kernel:function("sys_read") {
 >      self->syscall_name = "read"
 >   }

Yeah, I think this makes sense - I sort of assumed each probe would
have a string name associated with it and actually meant to write
$funcname instead of $syscall_name.

 > 
 >   and
 > 
 >   $pid = [[ in_interrupt () ? 0 : current->pid ]]   # possible embedded C
 > 
 > 
 > > [...]
 > >   Will this still work if count isn't a int value but say an int *?
 > > 	self->my_count = *count;
 > >   Seems to - if jprobes is being used, it's just a straight pass-thru.
 > 
 > Passing through in this sense concerns me.  If the scripting
 > language's type system is to remain as minimal and implicit as
 > possible, then operations like pointer dereferences and especially
 > structure accesses need to be represented and analyzed.  (See more
 > below.)

Yes, this was something I thought might apply only in the cases where
jprobes was being used - in the end, I don't think we'll want to
special-case anything, so this example really wouldn't be very
interesting except maybe for initial prototyping.

 > 
 > 
 > > [...]
 > > - To set up the probes, this example loops over each syscall and
 > >   registers the single probe handler for each one.  [...]
 > >   It seems to me that we need a way to enable and disable
 > >   probes as needed or 'just in time'.  For example, here's a probe that
 > >   we should be able to write:
 > > 
 > > /* trace all functions called from open */
 > > probe syscall:entry("open")
 > > {
 > > 	self->trace_all = 1;
 > > 	enable(*:entry(*)); /* enable probes on _all_ functions */
 > > }
 > 
 > I don't know if this will be possible.  Among other reasons we
 > discussed yesterday, "all functions" in the kernel is far too wide a
 > net.  If instrumentation were to be inserted anew every time, imagine
 > the thousands of pages of kernel text being modified, when any process
 > runs "open".  Else if breakpoints were inserted en masse at startup
 > time, and enabled/disabled by having them each execute some predicate,
 > overall performance would still come to a crawl.

Well, I think we need to be able to support this use case - how we
actually accomplish the effect is anyone's guess at this point.  The
reason I suggested this in the context of instrumenting syscalls was
because I was starting to think that instrumenting even 300 syscalls
at once might already be getting to be too unwieldy.  Instrumenting
all the functions in the kernel or even a significant fraction of them
at any time is clearly not what we'd ever want to do - thus my gut
feeling that there must be a more elegant way of accomplishing the
same effect.  One point though - even if you were to brute-force
instrument the entire kernel, the typical use cases wanting to do this
would only want to do it for very short periods e.g. between entry and
exit of a single function call.  It's interesting to note that DTrace
allows probes to be set for _every instruction_ in a certain range of
addresses - I can't imagine what that would make the system feel like.

 > 
 > 
 > >   [...]  It should support the print() function from probe handlers,
 > >   and it should also support queries from userspace applications
 > >   such that they can retrieve data from the probe at any time [...]
 > >   a simple protocol built on top of netlink seems to me to be the
 > >   best fit.  [...]
 > 
 > I wonder what sort of tool would want to extract data piecemeal like
 > this.  Are you imagining someone actually writing some user-level C
 > code to pull out data snapshots from a specific running probe?  I
 > wonder if this situation is likely to become common enough to warrant
 > a two-way API.

I was expecting that the systemtap command to begin with would be
using this API, which is two-way already, unless I'm missing your
meaning of two-way.  I imagine the systemtap command would request and
receive the data items pertaining to the probe when the probe is being
stopped (Control-C from the user for instance) and then display the
results to the user.  I imagine there would be some function that
would basically just send all the data associated with a given probe
without having to specify each piece individually.

 > 
 > By the way, one reason I prototyped that /proc-based data snapshot
 > mechanism that way was in recognition of the problem of consistency.
 > It suspends the probes, takes a snapshot of all global variables
 > during the incoming open() syscall.  It then lets the probes run again
 > and streams the textual snapshot out during subsequent read()'s.
 > The snapshot is thrown away at close().
 > 

The same thing should be possible with the netlink API.

 > If, as is likely, multiple pieces of data need to be pulled out of the
 > probes, it is important that those pieces be consistent with each
 > other: that they correspond to a locked snapshot taken at the same
 > instant.  Being able to pull out just one variable at a time would
 > make this property achievable only if it involved long-term suspension
 > of probe data collection between the adjacent pull operations.
 > 

Yes, the most important case is when a probe ends and you need to pull
out all the data associated with a probe, at which point there can be
no consistency problems.  I thought that generalizing this to any time
and to individual data items was a good idea, but it may be a case of
over-engineering... But now that I see that you're wanting to snapshot
at any time, I can imagine that individual data items might be
independent and might be independently retrievable, and it starts to
like a good idea again. ;-)

References:
- Some notes on translation
  - From: Tom Zanussi
- Re: Some notes on translation
  - From: Frank Ch. Eigler

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]