How to report bugs usefully
Systemtap operates close to the hardware - within the operating system kernel. As such, bugs can manifest themselves in ways that are unusually hard to debug. To give developers a fighting chance, they need a considerable amount of information. This page attempts to enumerate the kinds of useful information, and tells you how to submit it all. The stap-report script may help gather data.
All problems
We need to know which version of systemtap you're running, and on what script.
systemtap version (stap -V; if you didn't build it from git, where did your sources come from?; is that the only copy of systemtap installed?)
- directory where systemtap is installed
- the values for all environment variables
system compiler version (gcc -v)
- the systemtap script you were running, along with any custom tapset scripts
kernel version (exactly; if from a distribution package, identify exact version number; uname -a; kernel configuration options; from dmesg the version of gcc used to compile the kernel); any custom patches, which by the way are likely to make problem reproduction problematic!
- machine architecture (exactly, i686 vs. x86_64, xeon vs. celeron vs. opteron, SMP or UP, hugemem or hugesmp)
the version of all related packages installed: rpm -qa --qf %{name}-%{version}-%{release}.%{arch}\\n | egrep 'systemtap|elfutils|kernel|gcc' | sort
if an older version of systemtap works on the same system, then identify that version too and prepare a diff of their -p3 outputs (generated C code), and also perhaps of their respective runtime/ directories
any related messages found in a kernel log file (/var/log/messages) or console
any related messages printed by systemtap on the standard output / error channels. Use more -v (verbose) flags for more information.
if a probe module .ko file was generated, the output of modinfo FILE.ko
Translation-time problems
The systemtap tutorial contains some information on common errors detected during translation/compile time. Errors in this stage may be mystifying, but at least they are safe: the system won't have run anything privileged or dangerous yet.
If the translator (stap) crashes outright (SIGSEGV, assertion failure), the chances are good that the problem can be identified and fixed. Run it under a debugger, and report the stack traceback: gdb -args stap ...., then backtrace at the (gdb) prompt after the crash.
If the probe module loader (staprun) crashes, run it under gdb the same way. Since stap runs staprun via sudo, you may need to run gdb -args staprun .... as root.
GDB will produce better backtraces if your systemtap binary is compiled with debugging (-g). If you are running a prepackaged copy of systemtap, try installing its own -debuginfo package (if available for your distribution).
System crashes
The most unfortunate problems occur when systemtap compiles your probe script properly, but then the machine dies. It may be a problem with systemtap, or the kernel, or the compiler, or something totally different. The following information tends to help narrow down the possible list of suspects:
- machine vital statistics (memory size, number of processors, general system activity level)
what happened at the moment of crash (hang; oops message; panic; smoke rising through the vents; anything readable on a console?; can you try to use a serial console
- if you tried it multiple times, does the failure occur intermittently or every time? does it change from time to time? did it change with different concurrent system activity?
a transcript of the systemtap build process (stap -p4 -vv ....) used to translate/compile the script; don't worry, -p4 means not to run the script, just to translate it
the contents of /proc/modules
the contents of /proc/kallsyms, or at least the symbols that are also listed by the stap -p4 -vv .... run.
- a kdump image, or at least the backtrace produced from a subsequent run of the crash analysis tool
- the contents of a panic/bug/oops message, if any, with particular attention to the eip/pc value
if the faulting address is suspected to be within a stap module, insmod its cached copy by hand and check /proc/modulesto find a module-relative eip offset. Given the stap .ko file and that number, one can find the faulting area.
Submitting the problem report
To get started, it may help to read Eric Raymond's essay on asking questions the smart way. Imagine that the developers are robotic information eaters, and need to be talked in a crisp and information-dense way. It's nothing personal.
The bluntest way to get developers' attention is to file a report directly into our bug tracking system. Use the product "systemtap". Describe what you saw with as much detail as is reasonable. There is no need for multi-megabyte trace files, but numerous small attachments would be fine. Each new bugzilla entry sends an instant email to our mailing list, so the developers get notified right away. They may ask you further questions, to submit more specific information; or they may apologize, grovel, and swear that the bug will be fixed tomorrow. Or not (sorry).
If the bug tracking system seems too formal for you, you may just send the essentials to the mailing list directly. Or you can strike up a conversation on IRC, just in case one of the developers is taking a break from the whipping-induced coding frenzy (irc.freenode.net, #systemtap).