This is the mail archive of the
systemtap@sources.redhat.com
mailing list for the systemtap project.
4/20/05 meeting - minutes from morning session
- From: "Chen, Brad" <brad dot chen at intel dot com>
- To: <systemtap at sources dot redhat dot com>
- Date: Fri, 22 Apr 2005 16:04:56 -0700
- Subject: 4/20/05 meeting - minutes from morning session
Morning notes and bin list
Systemtap F2F
20 April 2005
Santa Clara CA
Attendees:
Brad Chen (Intel)
Charles Spirakis (Intel)
Ulrich Drepper (RedHat)
John Healy (RedHat)
Frank Eigler (RedHat)
K. Sridharan (Intel)
Martin Hunt (RedHat)
Will Cohen (RedHat)
Vara Prasad (IBM)
Rohit Seth (Intel)
Douglas Armstrong (Intel)
Karen Bennett (RedHat)
Elena Zannoni (RedHat)
Andrew Kleen (RedHat)
Rusty Lynch (Intel)
Anil Keshavamurthy (Intel)
Jim Keniston (IBM)
Hiam Nguyen (IBM)
Josh Stone (Intel)
Sri Doddapaneni (Intel)
Morning Presentations
9:30 Run time and kernel to user data transport (Martin)
10:00 Kprobes/Jprobes and their enhancements (Jim and Hien)
10:30 Build and test (Will)
11:00 Safety (Brad)
11:30 Scheduler and lock tapset (Doug)
======================================================================
Frank - Systemtap overview and history
- Solaris DTrace has made a lot of noise in the press
- Translator-based system, using simple scripting language like awk or
DTrace
- compile into kernel modules
Progress since initial concept
- a number of disposable prototypes
- RedHat work
- Martin's runtime code
- translator work
- IBM side
- kprobes extensions
- contributions from Jim and others to study open-holes in design
Expectation: something shareable/workable starts in a few weeks.
Q: When did development start in earnest (Rusty)
Frank: three or four months ago; mid-late January. Coding has mostly
been happening in the background, partly because design is still
developing. Some of code is throw away, hopefully not all of it.
Q: Is translator optional or required?
Frank: required.
Karen: If you see key opportunities or things to fix, please speak up.
===========================================================
Martin - Systemtap Runtime
[see also slides Martin posted to the mailing list]
The Runtime is:
- C code used by translator
- compiled into kernel module
- could be used to write providers (an option that might or might not
be supported)
Goals
- safety. Don't crash or corrupt the system
- Low system impact. No kernel stack usage. (limited stack usage)
- High performance
- flexible
- Ease of use - least of goals; mostly used by translator.
Uli - May need to use signal stack.
Martin - exists for all architectures?
Uli - Yes.
[Some detailed discussion with Uli and Rohit about use of signal
stack and how to use it and avoid abusing it. Note binlist item
on figuring out what to do for a stack...]
Rohit: we might also want to talk about branches vs. breakpoints,
at some point.
Possible kprobes work item: worry about use of the signal stack.
Will: haven't run into this problem in oprofile yet; routine is simple
and
light; 2-3 call levels deep at most.
Charles: VTune driver is similar in not using too much stack.
Uli: Some device drivers are il-behaved. Safest bet is to use a separate
stack
TEST CASE: find the worst-case case signal-stack kernel code
(probably a driver), and instrument it kprobes/systemtap to see what
happens.
Uli: safest plan: make a new private stack of limited size, don't use
the kernel or signal stack.
Vara: Will kernel folks be okay with a new stack?
Rohit: kprobes is already there; they shouldn't be overly concerned.
Jim voluteers to go the next step on this issue.
[back to Martin]
What's in the Runtime:
- associative arrays
- "maps"
- keys can be 1 or 2 strings or longs
- values: int64, string, or statistics
- maps have a max size, preallocated
- K to U transport
- relayfs or netlink, both very fast
- Q: Security issues for uplink? (Rusty) [BIN LIST]
- output formatting
- relatively straightforward
- backtraces
- backtrace kernel and user space works today
- issue: don't want to do symbolic lookups in user space
- might use dcookies; oprofile style
-
- copy from userspace funcs
- backtrace and register dump
- safe strings
- strings should not be allocated dynamically
- can't use the stack either
- soln: runtime uses statically allocated string space
Arch diagram:
- stpd daemon interacts with netlink or relayfs, generates files
Frank: Why a daemon rather than a command-line tool?
Martin: historical
- one stpd per loadable systemtap module
- multiple stpds can run simultaneously
- stpd loads and unloads modules
- data saved to per-cpu data files
- note: two users can't load the same module at the same time
potential issue: two users who want to run the same module,
with
pre-compiled modules
Q: How big would the maps for modules be? Space is limited (GB)
Frank: not nearly that big
- Postprocessing:
- done by stpd
- deferred symbolic lookups
- per-cpu data files need to be integrated and saved, etc.
- other things needed by translator
=================================================================
Jim - KProbes
- a facility for inserting a breakpoint in the kernel; specify a handler
function when the breakpoint is hit.
- currently suports x86, x86-64, ppc-64
- Rusty: IPF work underway; no important differences so far between
AMD64 and EM64T
- Key new features for Systemtap: multi probes, return probes
Multi-probes: multiple handlers for the same address
- two competing implementations: Prasanna:
- Ananth has posted a patch which seems to be okay to most people
Will be submitting this patch to MM. Patch is out on lkml.
Return probes: fires when a function returns
- Basic implementation: probe point at entry replaces RA on stack
with address of "trampoline." On return, transfer to trampoline
where a probe is waiting. Handler is invoked etc. Trampoline
code is a no-op; just a place to install a probe-point (kprobe)
- Q: Has this been tried on IA64? [Jim - No]. The IA64 might need
some special attention...
- GDB might provide a relevant point of comparison for IA64 support.
- Vara - x86, x86-64, and PPC all supportable.
- Brad: Intel folks need to check that API is supportable on Itanium.
- Jim - refer to design document on systemtap mailing list "return
probes" or "exit probes".
- Ananth on phone: maintains kprobes; in residence at RedHat (Westford)
for six months.
- Performance improvements: smarter locking for SMP support.
Ananth talks about specific work he's doing. Simpler locking
mechanisms to address outstanding issue...
Last items on Jim's list. Ongoing discussion on mailing list for
these issues (x86-64 RIP-relative addressing, x86_64 wart removal.)
Rohit: holding a semaphore in the kernel? how would you instrument
the interrupt path?
When you register a probe, must acquire a spin-lock...
Only in the probes setup, not use.
Current solution: pre-allocate the memory, one "slot" or cache-line per
CPU.
- eliminates need for semaphore
- concern about cache ping-pong: don't want to have processors bouncing
cache-lines back and forth as a part of kprobe setup, or (?)
singlestepping of probed instructions.
- basic problem is self-modifying code. Want all self-modifying code
to happen in a single unique address that is not shared by other CPUs.
- issue of allocating multiple locations instead of a single location:
execute bit, and 2G relative addressing limitation for x86-64
- Rohit taking the lead on understanding this issue for Intel
=======================================================================
Will - Testing
Goals
- ensure systemtap doesn't crash machines
- show customers it's tested and safe
- find and quickly fix faults
- provide guidiance on instrumentation costs and limitations
Rohit - customer visits. Everybody mentions DTrace, everybody
expects Systemtap to be as robust as DTrace.
Will -
What's checked in:
Components
- runtime, parser/translater, kprobes
system
- integrated components
Testing Issues
- User space environment different than kernel
- limited kernel stack space
- copy from user
- some operations not available in user space, such as accessing
PMU hardware
- need testing on various hardware and kernel configs, big systems and
apps
Frank: (to IBM) To what degree do you expect to be able to stress-test
system?
Frank: to what degree is IBM and Intel interested in this system working
on
"vanilla upstream" kernels?
Rohit: we are very interested
Vara: provides some detail on what kind of testing...
Rusty: the more public the test repository, the easier it is for Intel
to
help with the testing.
- non-determinism - fact that kernel is non-deterministic.
Runtime test plan
- Runtime library user-space function testing
- verify arg checking
- verify in user space that all paths through code exercised
(use GCC profiling support)
- check that memory leaks don't occur
- Runtime library kernel space testing
- individual functions
- data transport (only works in the kernel)
- failure modes (out of memory, etc.)
Parser/Translator component testing
- identify invalid input
- make sure parser/translator generate valid constructs
[tap validator might have a role on these]
KProbe and Kernel Instrumentation Component Testing
- Verify expected functionality of instrumentation
- torture tests
- maximum rate calling instrumented function
- repeated activation and deactivation
- attempt to create race conditions
- attempt to create recursive kprobes
- non-instrumentable locations
- get overhead estimates, max number of probes, etc.
RedHat Test Grid
- internal system to reserve and manage test resources
- collection of machines
- remote KVM - allows bios and boot-up config, power-cycle
- CVS repository for tests (gdb, oprofile, kernel torture (e.g.
cerberus)
- Most of the machines are in Westford.
- not accessible outside
- problem: testing on really large systems (64-way, terabyte memory...)
- Roland: might be good to figure out how to share kernel test suites
- Intel and IBM folks on-site at RedHat do have access to these test
suites
RPM and RedHat build system
...
========================================================================
Brad on Safety
Brad presented the Systemtap plans for safety. Much of
this has been worked through on the mailing list already.
Two undecided items:
- Tap Validator. Disassemble the tap and analyze for
safety properties. Brad did a simple demo of how illegal
instructions and confusing control flow could be exposed.
- Memory Portal, and, separating Policy from Mechanism.
Brad presented a number of simple policies, including
proposals for definitions of "safety" and "guru" modes.
separating Policy and Mechanism could be done with or
without the memory portal; Brad will be posting to the
mailing list to start a discussion on this topic.
Frank liked this table; suggested we put it into the arch paper.
Language Design
| Language Implementation
| | insmod checks
| | | Runtime Checks
| | | | Memory Portal
| | | | | Static Validator
| | | | | |
-------------------------------------------------------
infinite loops x x o
recursion x x o
division by zero x x o
-------------------------------------------------------
array bounds errors x x x o
invalid pointer errors x o
heap memory bugs x o
-------------------------------------------------------
illegal instructions x o
privileged instructions x o
memory read/write restrictions x x o o
-------------------------------------------------------
memory execute restrictions x x o o
version alignment x
end-to-end safety x x
separate policy from mechanism x
-------------------------------------------------------
There are lines to add to this table...
Another suggestion to draft some material for a security
section; that Systemtap doesn't change security of system,
possibility of someday supporting unpriviledged users with
appropriate memory protection etc.
========================================================================
Douglas Armstrong
Synchronization and locking tapset
Intel Thread Profiler team
[other tool is Thread Checker]
Most requirements derived from Intel Thread Profiler; others
from Thread Profiler or other general parallel programming needs.
Tool Interface:
- kernel tools: safety, hard to maintain
- expensive ring transitions
- tap events buffered and dumped to file: space issues
- tap events buffered and sent to user level tools in batches
- double buffering/memory mapped....
Vara: relayfs is a memory-mapped kernel buffer
Q from Will: User threads or Kernel threads?
- We'd like to catch both...
- Even with kernel threads, there are parts of synchronization that
are only implemented in user space.
Q: Frank - why do you want so much raw data, as opposed to aggregation?
- critical path analysis: requires script
- why not in kernel? memory usage, existing user-space implementation
- memory requirements: 500 GB of raw events
Q: Frank, Vara - it would be very helpful to understand the high-level
goals, instead of just talking about low-level details. Not
understanding
why traces are required.
Douglas: events are commonly calls into pthreads
Synchronization of Interest
- system synch objects
- user synch objects implemented by kernel
- user synch objects implemented in user space
- user synch objects; multi-level implementation
User synch objects
- mutex
- ...
- some objects have implementation split across user and kernel level
- 3rd party synchronization libraries
Blocking operations of interest
- File I/O
- Standard I/O
- ...
Synchronization taps:
- spin
- block
- ...
- notion of a tap that might correspond to a union of a set of
kernel events
vs. DTrace:
- Dtrace break down is based on object type. It would be nicer to
have a generic set, with top of object and particular object
available as data
- different flavors of acquire and release locks to expose contention.
- also a "cancel" event (timeout waiting on lock)
Sync Tap Data - What you would want (partial list)
- thread
- signaled thread
- call stack
- pid
- ...
Frequency of events?
- mutexes - 10s of thousands per second
- mostly user-only, uncontended
- would not commonly be seen in well-written applications
Q: How do you confirm that your measurement is not adversely impacting
the application behavior?
- have seen cases where 100s of microseconds causes big problems
- in other cases, milliseconds of events are not a problem
Roland: Overall being able to have zero or tiny overhead for
uncontended cases is important.
Frank: Relation of fast instrumentation of non-contested locks and
DTrace predicates? Are predicates fast enough? Worthy of immitating
in Systemtap? Possibility of using a predicate to identify
contended/uncontended, with predicate running at user level?
Process and thread create/destroy/manage events
- thread needs separate creation and start events
- internal scheduler events: when scheduler makes internal
decisions that impact scheduler policy.
Frank: so far, all this looks easy to implement in script
Scheduler perspective on active/inactive/on/off/yield/switch events
- contrast to synch perspective
- aggregating to reduce instrumentation points is a useful optimization
Will: how about interrupts?
- Worried about accurate accounting for interrupt handling time?
- Probably should be included in the list
Brad: also would like to expose affinity requests
Random taps:
- I/O
- Sampling (with randomization)
- Module taps - loading/unloading of modules
- Signal taps - events for signals and handlers
- begin/end taps
- Spin - ability to have the user-level probe
- User function entry/exit
- all, rather than selective
- it appears that DTrace star notation lets you express this
- User-code traps...
- all, rather than selective
Sri: recently, JVMs exporting hooks that support DTrace
======================================================================
Bin List
- Stack - "minimal stack usage" vs. alternate stack for kprobles [JK]
- container/keys larger than long for associative array indexing? [MH]
Physical address extension - 36 bits.
- Security model - user transport [MH]
SE Linux, more complex than root vs. !root (one possible model)
deferred; won't be complete for U2
- kernel models - implications of trying to load same module [FchE]
(in some form) twice. Also, other limitations of module address space,
vmalloc, static area, etc.
- IA64 stack & instruction architecture is "unique". Be aware [RS]
that traditional IA32 style assumptions will lead to issues.