This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Evaluating SystemTap for Network Response Times

From: Nathan DeBardeleben <ndebard at lanl dot gov>
To: "Frank Ch. Eigler" <fche at redhat dot com>
Cc: systemtap at sources dot redhat dot com
Date: Tue, 31 Jan 2006 11:44:22 -0700
Subject: Re: Evaluating SystemTap for Network Response Times
References: <20060131182916.GB15048@redhat.com>

Frank Ch. Eigler wrote:

Would you folks use static instrumentation too, if it were available
for systemtap?  That is, would you be willing to insert macro calls
into your kernel sources, which would get roughly djprobes-level
performance for enabled probes, and a slight slowdown for disabled
ones?
(http://sourceware.org/ml/systemtap/2005-q4/msg00415.html)

We probably wouldn't be willing to do anymore static kernel instrumentation. We have a number of projects here which have been kernel patches that are moving away from it if only because tracking the kernel has been way too time consuming. That probably wouldn't be the case with this style of static instrumentation, but there are other issues. In particular, one of our focuses is to attach to a running machine after we start to observe problems, probe into the kernel, figure out what's going on, and then detach ourselves.

This whole idea of attach / detach is really at the heart of it.

For instance, we've had problems before with certain chipsets on network cards that we tested at beta worked great under heavy load and then when we got the official versions the chipset was slightly changed. We would see that if we really hammered these cards in a parallel machine, slamming the network as we often do, that one of the cards would randomly timeout and the driver would reset it and it would continue. They FUNCTIONED but their performance would periodically drop to about (seriously) 20,000 times slower for 1 network operation - then it would fix itself and move on. If you then say we've got a couple hundred to thousands of these cards in a large machine, the probability of this happening obviously goes up and we're in for a lot more trouble. Code appears to start slowing down, and people try and figure out where the problems are.

Point being - if we had this type of instrumentation I'm wanting to make with SystemTap we'd see outlying socket operations and could collocate those problems with our application's problems. We could then know a lot more about what's going on with our system - from the start to finish.

Also - of similar problem with network cards we often see what appears to be a hung system. We dynamically attach to a running app and find it's sitting in an call waiting on a network operation to complete. We then wonder - is the kernel waiting? Is the kernel stuck? Is our network dead? So many questions and we hope that with some careful probing we can hook into this stuck application and really zero down on where the problem is. Believe it or not - we find kernel bugs semi-regularly, and they can waste tons of our time trying to track them down.

Sorry for the rambling - trying to paint a picture for you of where we're coming from and where we're hoping SystemTap can help us.

I haven't had time yet to start digesting the scripts that have been linked to me this morning so I'll need to do that before I can really determine what further assistance / direction we need.

I really appreciate everyone's time. Take care.

-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard@lanl.gov
---------------------------------------------------------------------

Follow-Ups:
- Re: Evaluating SystemTap for Network Response Times
  - From: Frank Ch. Eigler

References:
- Re: Evaluating SystemTap for Network Response Times
  - From: Frank Ch. Eigler

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]