This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: User-space probes: Plan B+


On 9/7/06, Richard J Moore <richardj_moore@uk.ibm.com> wrote:

systemtap-owner@sourceware.org wrote on 25/08/2006 21:14:44:


> On 25 Aug 2006 11:22:51 -0700, Jim Keniston <jkenisto@us.ibm.com> wrote:
> > On Fri, 2006-08-25 at 01:11, James Dickens wrote:
> > > On 24 Aug 2006 18:13:24 -0700, Jim Keniston <jkenisto@us.ibm.com>
wrote:
> > ...
> > > >
> > > > I tried an approach based on ptrace, with no kernel enhancements,
but
> > > > it lacked certain necessary features (e.g., #2-5 below), probe
overhead
> > > > was 12-15x worse than Prasanna's approach, and I couldn't get it to
> > > > work when probing multiple processes.  (Frank Eigler independently
> > > > suggested this approach and termed it "Plan B from outer space.")
> > > >
> > >
> > > is 12-15x worse than the current solution used in strace?
> >
> > Slightly worse.  When just counting the occurrences of 1 system call, I
> > clocked strace at about 10 usec/hit.  See
> > http://sourceware.org/ml/systemtap/2006-q2/msg00572.html
> > And some folks reportedly consider strace too slow.
> >
> > ...
> > > > 1. Instrumentation can be coded entirely as a user-space app...
> > >
> > > sounds like a nightmare waiting to happen, if i want to trace
> > > something from userland into the kernel and back, i start writing
> > > userland code, then into kernel code, and quite possibly having
kernel
> > > code access variables and statisics stored in userland, meaning lots
> > > of checks that the user remembers to call the routines that safely
> > > move data back and forth between the two?
> >
> > Well, sure, users could get confused and do things wrong.  And your
> > scenario below where you migrate a piece of instrumentation from user
> > space to kernel space would have to be managed carefully, just like any
> > other design change.
> >
> if you do it entirely in the kernel, then you don't have to deal with
> design changes based on how busy the target system is, so we can use
> the same script the developer used to analzye during debugging even
> when its in production with 1000 times the workload.
>
> Probing a function that is called often would be a major slowdown, as

Possibly, but not necessarily. It depends on the execution time of the
probe handler compared with the mean time to return to the probe. Beacuase
some cases might not suit this technique is not reason to deny its use in
other cases.

It is when you change the entire programming environment. If you start
with one method and then have to change to another because you now
want to change the workload, especially when you stand a good chance
of breaking your algorthyms.



> soon as you fire a probe the entire application stops, instrumenting > something like malloc creates a huge slow down as your process, goes

Show us your measurements.
didn't measure anything yet. But are you trying to say that going from
jumping from one procces into the kernel and then into a second
process then back is a 0 cost event over just jumping into the kernel
and back? I have seen people working on this project spend months
changing probe type and implementation details to gain  a few
millionth of a second, seems that they wouldn't settle for extraneous
context switches very long.

further more no matter how fast you make the process of bouncing
between userland tasks, you end up with a serialized proccess since
you can only handle one event at a time. Or are you proprosing you add
a thread for every running thread? Instead of handling it in the
kernel without all this hassle?



> to the kernel, then back userland to run the script, and then back
> even if the probe wasn't even interested in the particular event.
>
> It gets worse with a multithreaded task, not only do you have the

Not necessarily. Originally locking was global and would serialize all
probes across all processors and that of course would slow things up a bit
when a 2nd probe fired before the current handler had ended. But the code
has been enhanced quite considerably to ensure that locking is granular.
And there are further improvements that can be applied.


any locking contention serializes the threads thus hiding the problems you were possibly looking for in the first place. if you aren't having any lock contention why use locks?

> probe firing more often, the application becomes serialized, so whole
> process slows down tremendously making it not usable in a production
> environment, it would also eliminates races. So users will either say
> once I turned on the probes performance dies, or that the problem
> disappears, the race is gone. The more scalable the application the

Example?

probing malloc on a multithreaded application, or worse yet object allocation in Java. Here is a little script I created in dtrace, its being attached to a multithreaded task (secure global desktop) that is esentially idle, with only one user trying to login, by the way its being executed on a 10 year old system a dual 300mhz ultra 2 workstation, it would be much worse If i had 100's users logging in concurrently on a modern dual core opteron box. As you can see it fires the probe over 13,000 times and the machine is not even more than 50% busy at worst. Are you ready to guarantee no lock contention when you add 13,000 context switches to a production box? how about 130,000? The object of this script was to count probe fires, I could of just as easy added a predicate to count only certain probe fires, but in the case of systemtap with the proposed userland script doing the monitoring, it would still require a context switch for each probe fire.

dtrace: script 'malloc.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
 1  49136                        :tick-60s

 malloc                                                        13614
enterprise:~# cat malloc.d
pid$target::malloc:entry
{
@malloc[probefunc] = count();
}

tick-60s  {
  exit(0);
}

loadavg.
Total: 385 processes, 972 lwps, load averages: 0.30, 0.46, 0.50



> worse the slowdown. > > > > > But I think it's better to provide a feature for which a need has been > > identified -- even if the feature requires careful use and a few minutes > > to understand -- than to withhold the feature to protect people from > > failing. (I consider asm statements in gcc an extreme example of this > > philosophy. :-)) > > > > its better to design the system with safety and security in mind. This

Not necessarily, but as it happens, we have done.

> can and has been done. They ended up with a solution that works for
> the expert programmer and overworked system administrator, as well as
> the weekend home user just hoping to help out a project find a
> bottleneck.
>
> > >
> > > how is this better than just enhancing a debugger such as gdb?
> >
> > Among other things, gdb -batch is relatively slow (I measured 111 usec
> > per hit just to count breakpoint hits) and has no facility for
> > interacting with kernel-space instrumentation.
> >
> > > how are
> > > stacks dealt with, since you quite possibly having one process
> > > investigate another, if you don't get everything perfect the program
> > > being watched can corrupt the data of the second?
> >
> > Well, somebody with root privileges could register a handler that
> > scribbles just about anywhere, as is the case currently with kprobes.
> > But there's no reason to expect that there's any danger of the
> > particular problems you mention.
> >
> > >
> > > >
> > > > 2. ... but in situations where performance is critical, uprobes can
> > > > run a named kernel handler without waking up the tracer process.
> > > >
>
> To avoid the aforementioned multithreaded problem,  we have to resort
> to counting probe fires without any intelligence about when we record

If you're happy with such a significant limitation then fine. But I'm not.
Furthermore I distinguish two major needs:

1) applicaiton debugging - which is of less personal interest to me but
nonetheless important;

99.99% of your target audience will want to use systemtap for debugging userland applications, unless you want to surender userland debugging to Solaris, FreeBSD and OSX that either now or will shortly support probing userland applications using dtrace.

2) system debugging - which may necessitate reference to user-space data or
indeed necessitate a probe to be triggered when code executes in
user-space. This is very much of interest to me and something that the
original design catered well for.

keeping userland probing code in the kernel still works well for this
case as well implementing userland probing code adds more
complications not less, if you know you will always have to use
functions that can access userland data from the kernel, you don't
have to change your code based on whether you are using it from probe
A a kernel space probe or from probe B a userland probe.


James Dickens uadmin.blogspot.com

> the information and what information to store when we are called, it
> may be beneficial to do time expensive things like a stack trace, if
> we meet a certain criteria, or to slow down one thread occasionally to
> look for races.
>
> James Dickens
> uadmin.blogspot.com
>
>
> > > now if we start out coding our script to only work in userland, then
> > > all of a sudden we decide we need better performance, we have to go
> > > back and recode parts to work in kernel land and quite possibly break
> > > our algorythms that were talking to kernel land, or probes in the
> > > kernel that accessed userland data that just moved back into the
> > > kernel?
> >
> > See above.
> >
> > >
> > > > 3. A user-mode tracer can invoke a previously registered
kernel-mode
> > > > handler, so we have simple and efficient communication between
user-
> > > > and kernel-mode instrumentation.
> > >
> > > how do you keep a userland program from exploiting systemtaps
> > > arcutecture and executing kernel probes from other active systemtap
> > > scripts, isn't this a huge back door for rootkits especially once
> > > people start using systemtaps methods for monitoring systems
> > > continuously?

No more a back door than allowing any system tool to be used by the
non-privileged user.

> >
> > I've certainly thought about the potential for abuse via
> > uprobe_run_khandler().  If you had the connivance of somebody with root
> > privileges who installed a pernicious handler, you could do all sorts
of
> > bad stuff (and make it relatively hard to track).  That's a big if,
> > though.  If a bad guy has root privileges, you're toast anyway.

You don't need systenm tap to do bad things if you have an untrustworthy
root user.


> > > > And if you're worried about the handler reading/writing the wrong > > process's address space, you can specify when you register the handler

Isn't this scenario fantasy-land?


> > that it can apply only to the process in the caller-provided uprobe > > object -- and only when the caller has permission to trace that process. > > > > ... > > > > > > > > 8. Handlers run in process context -- the tracee's context (see > > > > requirement 2) or the tracer's context while the tracee is stopped > > > > (see requirement 3). > > > > > > > > > > stack corruption or even slight stack placement differences, would > > > serverly limit the usefulness of the solution, > > > > Well, yes, both we and the user will have to be careful. That's the > > nature of programming. > > > > > it will have the same > > > effect as debugging an app in gdb, the app only breaks when the > > > userland debugger is not running. > > > > That (minimizing probe overhead) is one of the points of being able to > > avoid unnecessary context switches, by just running a handler in the > > kernel. (See requirement #2.) > > > > > > > > > > > James Dickens > > > uadmin.blogspot.com > > > > Thanks. > > Jim > > > >

- -
Richard J Moore
IBM Advanced Linux Response Team - Linux Technology Centre
MOBEX: 264807; Mobile (+44) (0)7739-875237
Office: (+44) (0)1962-817072




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]