This is the mail archive of the
systemtap@sources.redhat.com
mailing list for the systemtap project.
Re: Some notes on translation
- From: Tom Zanussi <zanussi at us dot ibm dot com>
- To: "Frank Ch. Eigler" <fche at redhat dot com>
- Cc: Tom Zanussi <zanussi at us dot ibm dot com>, systemtap at sources dot redhat dot com
- Date: Fri, 25 Feb 2005 17:03:50 -0600
- Subject: Re: Some notes on translation
- References: <16925.3337.489506.767382@tut.ibm.com><20050225210509.GC27468@redhat.com>
Frank Ch. Eigler writes:
> Hi -
>
>
> > [...]
> > - self->xxx means xxx is a thread-local variable
>
> I'm unfond of the pointer syntax in the script language (see below),
> but this particular case can be mapped easily in the parser to an
> array index operation like "xxx[$pid]".
Right, I wasn't really meaning to suggest a syntax for any of this - I
just wrote down what came to me naturally while being too lazy to
look at the actual grammar ;-)
>
>
> > - $xxx is shorthand for values to be substituted by runtime library
> > functions or probe variables, similar to Perl interpolation
>
> We may need to consider a naming system that can be composed into
> richer identifiers. There are several types of variables to access:
> - "macros" like "$timestamp", which map to snippets of code
> - target-side variables: local (function parameters, locals), global
> - probe-side special variables like your "$syscall_name"
>
> > [...]
> > probe syscall:exit("read")
> > read_times[$syscall_name]
> > [...]
>
> I am aware of no plausible run-time library function that can return
> the name of the current system call. Rather, I imagine this sort of
> facility working by having a library of systemtap script fragments
> that provide definitions for probe points or helper variables:
>
> probe syscall("read") = kernel:function("sys_read") {
> self->syscall_name = "read"
> }
Yeah, I think this makes sense - I sort of assumed each probe would
have a string name associated with it and actually meant to write
$funcname instead of $syscall_name.
>
> and
>
> $pid = [[ in_interrupt () ? 0 : current->pid ]] # possible embedded C
>
>
> > [...]
> > Will this still work if count isn't a int value but say an int *?
> > self->my_count = *count;
> > Seems to - if jprobes is being used, it's just a straight pass-thru.
>
> Passing through in this sense concerns me. If the scripting
> language's type system is to remain as minimal and implicit as
> possible, then operations like pointer dereferences and especially
> structure accesses need to be represented and analyzed. (See more
> below.)
Yes, this was something I thought might apply only in the cases where
jprobes was being used - in the end, I don't think we'll want to
special-case anything, so this example really wouldn't be very
interesting except maybe for initial prototyping.
>
>
> > [...]
> > - To set up the probes, this example loops over each syscall and
> > registers the single probe handler for each one. [...]
> > It seems to me that we need a way to enable and disable
> > probes as needed or 'just in time'. For example, here's a probe that
> > we should be able to write:
> >
> > /* trace all functions called from open */
> > probe syscall:entry("open")
> > {
> > self->trace_all = 1;
> > enable(*:entry(*)); /* enable probes on _all_ functions */
> > }
>
> I don't know if this will be possible. Among other reasons we
> discussed yesterday, "all functions" in the kernel is far too wide a
> net. If instrumentation were to be inserted anew every time, imagine
> the thousands of pages of kernel text being modified, when any process
> runs "open". Else if breakpoints were inserted en masse at startup
> time, and enabled/disabled by having them each execute some predicate,
> overall performance would still come to a crawl.
Well, I think we need to be able to support this use case - how we
actually accomplish the effect is anyone's guess at this point. The
reason I suggested this in the context of instrumenting syscalls was
because I was starting to think that instrumenting even 300 syscalls
at once might already be getting to be too unwieldy. Instrumenting
all the functions in the kernel or even a significant fraction of them
at any time is clearly not what we'd ever want to do - thus my gut
feeling that there must be a more elegant way of accomplishing the
same effect. One point though - even if you were to brute-force
instrument the entire kernel, the typical use cases wanting to do this
would only want to do it for very short periods e.g. between entry and
exit of a single function call. It's interesting to note that DTrace
allows probes to be set for _every instruction_ in a certain range of
addresses - I can't imagine what that would make the system feel like.
>
>
> > [...] It should support the print() function from probe handlers,
> > and it should also support queries from userspace applications
> > such that they can retrieve data from the probe at any time [...]
> > a simple protocol built on top of netlink seems to me to be the
> > best fit. [...]
>
> I wonder what sort of tool would want to extract data piecemeal like
> this. Are you imagining someone actually writing some user-level C
> code to pull out data snapshots from a specific running probe? I
> wonder if this situation is likely to become common enough to warrant
> a two-way API.
I was expecting that the systemtap command to begin with would be
using this API, which is two-way already, unless I'm missing your
meaning of two-way. I imagine the systemtap command would request and
receive the data items pertaining to the probe when the probe is being
stopped (Control-C from the user for instance) and then display the
results to the user. I imagine there would be some function that
would basically just send all the data associated with a given probe
without having to specify each piece individually.
>
> By the way, one reason I prototyped that /proc-based data snapshot
> mechanism that way was in recognition of the problem of consistency.
> It suspends the probes, takes a snapshot of all global variables
> during the incoming open() syscall. It then lets the probes run again
> and streams the textual snapshot out during subsequent read()'s.
> The snapshot is thrown away at close().
>
The same thing should be possible with the netlink API.
> If, as is likely, multiple pieces of data need to be pulled out of the
> probes, it is important that those pieces be consistent with each
> other: that they correspond to a locked snapshot taken at the same
> instant. Being able to pull out just one variable at a time would
> make this property achievable only if it involved long-term suspension
> of probe data collection between the adjacent pull operations.
>
Yes, the most important case is when a probe ends and you need to pull
out all the data associated with a probe, at which point there can be
no consistency problems. I thought that generalizing this to any time
and to individual data items was a good idea, but it may be a case of
over-engineering... But now that I see that you're wanting to snapshot
at any time, I can imagine that individual data items might be
independent and might be independently retrievable, and it starts to
like a good idea again. ;-)