This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Evaluating SystemTap for Network Response Times
- From: Nathan DeBardeleben <ndebard at lanl dot gov>
- To: "Frank Ch. Eigler" <fche at redhat dot com>
- Cc: systemtap at sources dot redhat dot com
- Date: Tue, 31 Jan 2006 11:44:22 -0700
- Subject: Re: Evaluating SystemTap for Network Response Times
- References: <20060131182916.GB15048@redhat.com>
Frank Ch. Eigler wrote:
Would you folks use static instrumentation too, if it were available
for systemtap? That is, would you be willing to insert macro calls
into your kernel sources, which would get roughly djprobes-level
performance for enabled probes, and a slight slowdown for disabled
ones?
(http://sourceware.org/ml/systemtap/2005-q4/msg00415.html)
We probably wouldn't be willing to do anymore static kernel
instrumentation. We have a number of projects here which have been
kernel patches that are moving away from it if only because tracking the
kernel has been way too time consuming. That probably wouldn't be the
case with this style of static instrumentation, but there are other
issues. In particular, one of our focuses is to attach to a running
machine after we start to observe problems, probe into the kernel,
figure out what's going on, and then detach ourselves.
This whole idea of attach / detach is really at the heart of it.
For instance, we've had problems before with certain chipsets on network
cards that we tested at beta worked great under heavy load and then when
we got the official versions the chipset was slightly changed. We would
see that if we really hammered these cards in a parallel machine,
slamming the network as we often do, that one of the cards would
randomly timeout and the driver would reset it and it would continue.
They FUNCTIONED but their performance would periodically drop to about
(seriously) 20,000 times slower for 1 network operation - then it would
fix itself and move on. If you then say we've got a couple hundred to
thousands of these cards in a large machine, the probability of this
happening obviously goes up and we're in for a lot more trouble. Code
appears to start slowing down, and people try and figure out where the
problems are.
Point being - if we had this type of instrumentation I'm wanting to make
with SystemTap we'd see outlying socket operations and could collocate
those problems with our application's problems. We could then know a
lot more about what's going on with our system - from the start to finish.
Also - of similar problem with network cards we often see what appears
to be a hung system. We dynamically attach to a running app and find
it's sitting in an call waiting on a network operation to complete. We
then wonder - is the kernel waiting? Is the kernel stuck? Is our
network dead? So many questions and we hope that with some careful
probing we can hook into this stuck application and really zero down on
where the problem is. Believe it or not - we find kernel bugs
semi-regularly, and they can waste tons of our time trying to track them
down.
Sorry for the rambling - trying to paint a picture for you of where
we're coming from and where we're hoping SystemTap can help us.
I haven't had time yet to start digesting the scripts that have been
linked to me this morning so I'll need to do that before I can really
determine what further assistance / direction we need.
I really appreciate everyone's time. Take care.
-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard@lanl.gov
---------------------------------------------------------------------