Bug: runtime: taskfinder2: stap_stop_task_finder() might busy-wait a spinlock forever
The __stp_inuse_count counter might get out of sync when kernel memory
allocations fail. This leads to stap_stop_task_finder() waits for the
counter forever.
This might happen on systems short of available memory. The stapio
process might stuck at almost 100% CPU usage (though it is not a CPU
soft-lockup due to the use of schedule() function calls inside the lock
waiting loop. The hottest kernel backtrace for such stapio processes
look like this:
Bug: procfs: NULL ptr deref might happen in relay_file_open()
inode->i_private might be NULL ocassionally in relay_file_open()
(which is triggered by stapio's openat() syscall) due to a race
condition in our __stp_procfs_relay_create_buf_file_callback()
function.
Add a wrapper around kernel's relay_file_open() for our procfs's
open operation so that we always check if inode->i_private is NULL.
Stan Cox [Fri, 9 Sep 2022 20:09:10 +0000 (16:09 -0400)]
Initial python 3.11 backtrace support
Use @defined to handle PyFrameObject members moved to PyInterpreterFrame by
python 3.11 such as: f_code f_back f_globals, f_localsplus, f_lasti. Support
3.11 dictionaries: me_value, me_key, dk_size, dk_entries, dk_kind. Support
3.11 local variable accessing: co_varnames.
Optimize: runtime: context: avoid allocating context structs for offline CPUs
We used to allocate context structs for all the "possible CPUs", which
is quite wasteful.
Some VM hypervisors like VMWare assigns large number of "possible CPUs"
to their guests by default, which might lead to huge amount of memory
allocated in the stap ko module.
Optimize: runtime: print: avoid allocating string buffers for offline CPUs
We used to allocate string buffers for all the "possible CPUs", which
is quite wasteful.
Some VM hypervisors like VMWare assigns large number of "possible CPUs"
to their guests by default, which might lead to huge amount of memory
allocated in the stap ko module.
Frank Ch. Eigler [Fri, 19 Aug 2022 19:00:22 +0000 (15:00 -0400)]
PR29507: generalize sample python tapset for loose python{2,3} library versions
We can rely on stap 4.2+'s probe-context passing to functions to make
it unnecessary to decorate each @cast() with a libpython path name.
This lets these tests work on a range of python libraries.
These helper functions really should go into the standard python tapset,
rather than sit here in the examples, but that's for later.
Martin Cermak [Wed, 20 Jul 2022 10:50:00 +0000 (12:50 +0200)]
Fix failing nfsd.createv3 in testsuite/buildok/nfsd-all-probes.stp
* tapset/linux/nfsd.stp: Make nfsd.createv3 and nfsd.createv3.return
optional in nfsd.entries, since the underlying probe point no longer
exists in kernels 5.19+ per kernel commit 1c388f27759c5d9271d4fca0 .
This fixes `stap -p4 testsuite/buildok/nfsd-all-probes.stp`.
* Testsuite/buildok/nfsd-detailed.stp: Make nfsd.createv3 tests
optional.
Note 1: testsuite/buildok/nfsd-all-probes.stp tries to compile
something like:
probe nfsd.* , nfsd.*.* , nfsd.*.*.* { ... }
which means that the testcase overrides the tree level ? optionality, and
forces each level of the tree to carry ? also.
Note 2: this update is an analogy to PR18856 / 3fc11ed07bad37 .
Stan Cox [Wed, 13 Jul 2022 13:49:51 +0000 (09:49 -0400)]
python 3.11 removed direct access to PyFrameObject members
Take into account the change in PyFrameObject definition to allow
building systemtap with python 3.11. Additional support for python
3.11 is forthcoming.
William Cohen [Wed, 13 Jul 2022 16:09:26 +0000 (12:09 -0400)]
Make variable initializer work with RHEL6 compiler
The gcc 4.4 compiler in RHEL 6 does not understand initializer that
use ".field=". Adjusted the variable initialization to work with the
older compiler.
William Cohen [Tue, 12 Jul 2022 01:08:46 +0000 (21:08 -0400)]
Update sleeptime.stp to work with newer kernels and tracepoint syscalls
Newer kernels use syscall.clock_nanosleep instead of
syscall.nanosleep. In some cases tracepoint implementations of
syscall.* used which do not allow the use of @entry(). The revised
code has an explicit associative array to track time for syscall entry
rather than @entry() in the syscall.*.return handler.
William Cohen [Mon, 11 Jul 2022 22:10:01 +0000 (18:10 -0400)]
Extract the exit_reason from trace_kvm_exit vcpu argument on newer kernels
For x86_64 processors newer kernels change where the exit_reason
information is located. In older kernels the exit_reason was a
parameter for the trace_kvm_exit. For the newer kernels exit_reason
is a field buried in a member field of vcpu argument. Making
kvm_service_time.stp pick the appropriate location for exit_reason.
William Cohen [Fri, 24 Jun 2022 21:09:39 +0000 (17:09 -0400)]
PR29037 Handling gcc11 bitfields
The newer DWARF5 output provided by GCC11 no longer have a
DW_AT_data_member_location attributed describing where the bitfield is
located. This information needs to be extracted from the
DW_AT_data_bit_offset.
The patch maps the newer DWARF5 DW_AT_data_bit_offset information
internally to a format that matches up with the
DW_AT_data_member_location information because dwarf_getlocation_addr
function does not understand the DW_AT_data_bit_offset. An equivalent
DW_AT_data_member_location attribue based on the size of the
underlying type being used to store the bitfield and the
DW_AT_data_bit_offset information is generated.
The get_bitfield function was also modified to determine
the shifts and masking operations using the DW_AT_data_bit_offset.
William Cohen [Thu, 26 May 2022 20:45:52 +0000 (16:45 -0400)]
Filter out aarch64 mapping symbols
Like the 32-bit ARM the aarch64 also has mapping symbols in
the binaries to mark the start of A64 code ("$x") and data ("$d").
The code for 32-bit ARM has been extended to handle the aarch64.
This improves the backtraces from:
Sultan Alsawaf [Wed, 27 Apr 2022 01:36:05 +0000 (18:36 -0700)]
runtime: remove stap_utrace_detach_ops() from task_finder2
This function isn't actually needed because it's guaranteed that all the
utrace engines will be detached by the time stap_utrace_detach_ops() runs,
since utrace_exit() always runs before stap_utrace_detach_ops(). This is
evidenced by the utrace derived probe group being ordered above the
task_finder derived probe group in all_session_groups(): the utrace group
runs stap_utrace_detach_ops() on cleanup, and the task_finder group runs
utrace_exit() on cleanup, and since DOONE(task_finder) comes *after*
DOONE(utrace), the task_finder group's cleanup will always come *before*
the utrace group's cleanup.
The motivation for removing this is twofold: stap_utrace_detach_ops()
incurs cubic runtime complexity (it is called once for each utrace probe,
iterates across every process in the system, and then iterates across every
thread for each process), and it has two serious bugs that'd surface if it
did ever actually find a utrace engine to detach (which will never happen
since all the utrace engines are guaranteed to be detached beforehand, as
explained above).
The two bugs are related to the following call chains:
stap_stop_task_finder
utrace_exit
stp_task_work_exit <-- Dangerous to add task workers after this...
stap_utrace_detach_ops
rcu_read_lock <-- RCU read lock acquired...
stap_utrace_detach(tsk = do_each_thread())
utrace_control(UTRACE_DETACH)
utrace_do_stop
stp_task_notify_resume
stp_task_work_add <-- BUG: we won't wait for this worker to finish!
utrace_barrier(tsk != current)
schedule_timeout_interruptible <-- BUG: scheduling under RCU read lock!
rcu_read_unlock <-- RCU read lock released...
Since stap_stop_task_finder() always comes before stap_utrace_detach_ops(),
that means stp_task_work_exit() also happens beforehand, which is bad
because stap_utrace_detach_ops() may add a task worker. The added task
worker can thus run after the stap module is unloaded.
The other bug involves sleeping under an RCU read lock, which is expressly
forbidden.
Neither of these bugs can ever actually occur though because there will
never be any utrace engines to detach by the time stap_utrace_detach_ops()
runs. As such, remove stap_utrace_detach_ops() since it is not only buggy,
but also a CPU hog that slows down module unload. The old task_finder.c
used by ancient kernels still has its own copy of stap_utrace_detach_ops()
that's needed, so only skip emitting the stap_utrace_detach_ops() call when
task_finder2.c is used.
Sultan Alsawaf [Tue, 24 May 2022 03:13:30 +0000 (20:13 -0700)]
runtime: clean up procfs directory when transport init fails
When _stp_transport_data_fs_init() fails, the module's procfs directory
under /proc/systemtap isn't removed. Fix it by adding the missing call to
_stp_rmdir_proc_module() on error.
Noah Sanci [Wed, 11 May 2022 18:00:16 +0000 (14:00 -0400)]
stap-profile-annotate.in:
- Added context when context width is requested for readability
- Fixed potential issue where some paths taken from a
debuginfod server may cause a security leak by accessing
../ multiple times.
Martin Cermak [Mon, 9 May 2022 18:00:15 +0000 (20:00 +0200)]
refix PR28634 for rhel8+ kernels
The rhel kernel backports do not always align to upstream, so that
the KERNEL_VERSION() based gate needs to be updated using a version
that does the expected thing for the rhel{7,8,9} kernels.
Martin Cermak [Wed, 4 May 2022 09:21:17 +0000 (11:21 +0200)]
Testsuite: Prevent the hwcaps based dynamic loader search
The s390x syscall testsuite started to experience a problem where
the test logs were flooded with hundreds of newfstatat and openat
syscalls. The reason for this was that the dynamic loader was
searching for shared objects based on hwcaps:
This doesn't happen if the LD_LIBRARY_PATH isn't set. Another
approach to avoid the flood would be to export LD_HWCAP_MASK=0.
This patch only unsets LD_LIBRARY_PATH because it looks good
enough.
Instead of doing full autoreconf, which, in my case, would
produce a huge messy patch, this only is a targetted change
to the testsuite Makefile.{am,in}.
This update doesn't seem to break the dyninst part of the
testsuite, regardless of systemtap configure --prefix.
- reduce STP_RELAY_TIMER_INTERVAL to apprx. 1ms, so that userspace is informed
within about 1ms of a buffer being given content
- suppress the "bufhdr corrupt" warning from staprun, instead make the
runtime's "There were N transport failures" message a little more quantitative
With the previous change re. bulk mode, at least the occurrence of
transport failures falls off with these changes, especially for -b
bulk mode. More work is ongoing.
Before this patch, stapio needlessly synchronized on bufhdr sequence
numbers across CPUs, and then neglected to write those headers into
the stpd_cpuN files so stap-merge could do the sort. Now stap-merge
works again, and a bit faster.
Sultan Alsawaf [Thu, 28 Apr 2022 01:59:53 +0000 (18:59 -0700)]
buildrun.cxx: skip objtool processing for tracequery and typequery modules
The tracequery and typequery modules are never loaded, so objtool's
instruction rewrites for things like jump targets aren't needed. Since
objtool is slow and uses a lot of memory, skip it when compiling the
tracequery and typequery modules.
William Cohen [Wed, 27 Apr 2022 18:14:17 +0000 (14:14 -0400)]
PR29094: Include rpm/rpmcrypto.h when required
rpm-4.18.0 moved the prototypes for rpmFreeCrypto() into a new header,
/usr/include/rpm/rpmcrypto.h. Have the configure check for it
and include it when required.
Sultan Alsawaf [Wed, 27 Apr 2022 01:24:10 +0000 (18:24 -0700)]
runtime: fix tracepoint entry leak on error when add_probe() fails
When add_probe() in stp_tracepoint_probe_register() fails on a tracepoint
entry that's just been created, the refcount of the freshly-made tracepoint
entry will be zero by the time stp_tracepoint_exit() runs, at which point
stp_kernel_tracepoint_remove() will skip freeing the tracepoint because its
refcount won't be one. Furthermore, since stp_tracepoint_probe_unregister()
isn't called for a stp_tracepoint_probe_register() that fails, tracepoints
which are registered for internal stap use (like the utrace ones) cannot
be cleaned up on error by stp_tracepoint_exit(), so removing the refcount
check in stp_kernel_tracepoint_remove() won't always fix this.
As such, fix the leak by removing the tracepoint entry immediately on error
when it has a refcount of zero.
William Cohen [Tue, 26 Apr 2022 15:56:45 +0000 (11:56 -0400)]
PR29028: Support Linux kernels with CONFIG_RETHOOK set
The Linux 5.18.0 kernels added function exit_handler to fprobe
(https://lkml.org/lkml/2022/1/28/616). kretprobe makes use of that
infrastructure if it is available. However, this use of fprobe
infrastructure changes the member field location depending on
CONFIG_RETHOOK. Access to ret_addr field needs to be done through a
William Cohen [Tue, 26 Apr 2022 14:11:19 +0000 (10:11 -0400)]
Adjust ioblock.stp tapset includes for Linux 5.18.0
Linux kernel commit 322cbb50de711814c42fb088f6d31901502c711a moved the
contents of genhd.h into blkdev.h and eliminated genhd.h. Use genhd.h
for pre-5.18.0 kernels and blkdev.h for 5.18.0 and later.
William Cohen [Mon, 25 Apr 2022 19:02:15 +0000 (15:02 -0400)]
Avoid gcc-12 -Werror=format= issues in staprun/monitor.c
The %*s format in the wprintw takes a pair of arguments, an int and a
pointer to a string. The the width array supplying the first argument
was declared as size_t. On rawhide gcc-12 would flag those with
errors like the following:
monitor.c:450:27: error: field width specifier ‘*’ expects argument of type ‘int’, but argument 3 has type ‘size_t’ {aka ‘long unsigned int’} [-Werror=format=]
450 | wprintw(status, "\n%*s\t%*s\t%*s\t%*s\t%*s\t%*s\t%s\n",
| ~^~
| |
| int
451 | width[p_index], HIGHLIGHT("index", p_index, comp_fn_index),
| ~~~~~~~~~~~~~~
| |
| size_t {aka long unsigned int}
The %*s makes use of the integer sign to indicate whether to left
justify or right justify the output, so the cautious compiler flags
passing in the long unsigned int. To follow the %*s conventions made
width array an int which eliminates these errors.
High-message-rate stap scripts more easily lose message synch or bog
down if the subbuf size is large. PAGE_SIZE appears to be a sweet
spot, so let's fix that. (At least one subbuf is used per probe hit
that produces output. Allocation occurs at the subbuf granularity, so
making it smaller is apparently of no advantage.) stap -s and
free-memory still affect transport memory allocation, but only as to
the number of subbufs.
Sultan Alsawaf [Fri, 22 Apr 2022 23:06:45 +0000 (16:06 -0700)]
runtime: fix timing stat leaks when module init fails partway through
When systemtap_module_init() fails partway through, cleanup isn't done for
stp_session_init(), which allocates memory for probe and refresh timing
stat collection. Fix it by adding the appropriate cleanup on error to
systemtap_module_init().
Sultan Alsawaf [Thu, 21 Apr 2022 20:58:58 +0000 (13:58 -0700)]
runtime: use RCU-protected get_mm_exe_file() on old kernels that have it
Some old kernels (such as the one in CentOS 7) have the RCU-protected
get_mm_exe_file() patch backported to them, in which case it's preferable
to make use of the RCU optimization to avoid sporadic failures from the
down_read_trylock() due to mmap_sem contention. Since the commit that adds
the RCU protection to get_mm_exe_file() also adds a get_file_rcu() macro,
we can just check for the existence of get_file_rcu() on kernels < 4.1. If
the macro doesn't exist for some reason despite the old kernel having the
RCU optimization, we just fall back to using down_read_trylock() the same
as before. If the old kernel has get_file_rcu() despite lacking the RCU
protection that goes along with it, then said kernel has bigger problems.
Sultan Alsawaf [Thu, 21 Apr 2022 00:11:37 +0000 (17:11 -0700)]
staprun: interpret a non-zero systemtap_module_init() return as an error
Errors returned from systemtap_module_init() can often be positive, and
tracking down all sources of the positive return values is error-prone.
Instead, simply interpret any non-zero return from systemtap_module_init()
as an error so that staprun doesn't poll forever on waiting for a dead
stap module to do something.
Sultan Alsawaf [Wed, 20 Apr 2022 23:49:40 +0000 (16:49 -0700)]
runtime: clean up when starting the task finder fails partway through
When the task finder fails to start, systemtap_module_exit() won't be
called to handle the cleanup because systemtap_module_init() will have
returned an error. This becomes lethal when the task finder errors out
*after* initializing utrace, since that means utrace won't be stopped and
thus the utrace tracepoint callbacks will remain registered after the stap
module is unloaded, causing the kernel to explode spectacularly upon
executing code in memory that's been freed.
To fix this, make stap_start_task_finder() handle partial cleanup itself
when there's an error, since systemtap_module_exit() won't be the one to do
it. This also reorders the task finder starting process to make the hardest
item to clean up (utrace init) come last, and removes a bogus decrement on
the task finder state variable on error since we now know the hard way that
stap_stop_task_finder() won't actually be called to do cleanup when there's
a failure partway through stap_start_task_finder().
Sultan Alsawaf [Tue, 12 Apr 2022 21:00:47 +0000 (14:00 -0700)]
runtime: fix race between different stap modules creating /proc/systemtap
Since stap modules operate independently of one another, there's a race
between the first stap modules loaded on a system where they try to create
/proc/systemtap and all but one fail, leading to the losing stap modules
either failing to load on 3.19+ kernels or loading successfully on <3.19
kernels but leaking an inode and directory refcount, with both cases
additionally producing a WARN.
To fix this, we abuse `module_mutex` in the kernel to synchronize between
all stap modules, which resolves the race completely. However, on 5.12+
kernels, `module_mutex` is no longer an exported symbol and therefore we
cannot find its address and use it unless the host kernel is built with
CONFIG_KALLSYMS_ALL=y and the address of kallsyms_lookup_name() is resolved
in a way that doesn't require the transport to be active (since, right now,
staprun sends the address of kallsyms_lookup_name() via the transport).
This lack of coverage on 5.12+ turns out to be alright though because the
only real issues we're concerned about fixing are the leaks on <3.19
kernels and the module load failure on 3.19+ kernels. Since the lack of
synchronization on 5.12+ kernels will only lead to a cosmetic WARN at
worst, we simply ignore any error from proc_mkdir() when making
/proc/systemtap and thus the module load failure is avoided. Nonetheless,
we still optimistically avoid the cosmetic WARN on kernels >3.19 and <5.12
by using `module_mutex` if it's exported.
Since we don't own `module_mutex`, we elide it after the race window passes
in order to limit the scope of our abuse. Once the race window passes, the
overhead in _stp_mkdir_proc_module() goes back to exactly how it was prior
to this change; i.e., the average case will still be just the single check
for the existence of /proc/systemtap and nothing more.
Stan Cox [Tue, 12 Apr 2022 15:21:01 +0000 (11:21 -0400)]
Have the stap server mok sign modules using stap --sign-module=PATH
Add --sign-module=PATH for use by stap server to pass a specific client
fingerprint to stap for mok signing a module. Add mok path to mok_sign_file,
sign_module, and mok_dir_valid_p. Use mok path to differentiate --sign-module
vs --sign-module=PATH. Without PATH, fingerprints that are considered are those
present in $SYSTEMTAP_DIR/.systemtap/ssl/server/moks that are also listed by
'mokutil -l'
New tool to profile a process or userspace generally, then produce a
hit-counted annotated version of all the relevant sources.
Downloading all the debuginfo & source files requires a working
debuginfod-find with a set $DEBUGINFOD_URLS.
Includes tests and man page.
Signed-off-by: Noah Sanci <nsanci@redhat.com> Signed-off-by: Frank Ch. Eigler <fche@redhat.com>
William Cohen [Wed, 6 Apr 2022 19:12:55 +0000 (15:12 -0400)]
Adjust threadstacks.stp to work with newer versions of glibc
Newer versions of glibc have moved the allocate_stack function from
libpthread.so.* to libc.so.*. Similarly, the default stack size has
been moved to a different target variable. The threadstacks.stp
script needed to be adjusted to use the new probe point and target
variable.
Martin Cermak [Tue, 5 Apr 2022 19:48:20 +0000 (21:48 +0200)]
The faccessat2 and adjtimex syscall updates
- compat_unistd.h: Add missing defines for faccessat2.
- compile_flags.exp: Omit -m64 on aarch64, where GCC doesnt't recognize
such a cmdline switch (using it causes a compile
time error).
- adjtimex.c: Testcase update for modern glibc and kernel.
- systemtap.syscall/tapset/syscall.stp: clock_adjtime64 user persp alias.
William Cohen [Mon, 4 Apr 2022 23:23:30 +0000 (19:23 -0400)]
Add riscv specific ptrace support functions
The riscv linux kernel does not add any other ptrace functionality in
addition to the kernel's base ptrace_request function. Thus, the
_arch_ptrace_argstr and _ptrace_return_arch_prctl_addr functions do
very little. They are defined to allow systemtap scripts
instrumenting ptrace syscalls to compile on riscv.
Stan Cox [Tue, 29 Mar 2022 01:08:34 +0000 (21:08 -0400)]
Add --sign-module to enable users to mok sign their own modules
Add sign-module option. Move MOK_CONFIG_TEXT, mok_dir_valid_p, mok_sign_file,
generate_mok from stap-serverd.cxx to cscommon.cxx. Add sign_module function
to cscommon.cxx. Move MOK_PRIVATE_CERT_NAME, MOK_PRIVATE_CERT_FILE,
MOK_CONFIG_FILE to cscommon.h. Add report_error parameter to generate_mok,
sign_module, mok_dir_valid_p so they can be called from server or client. If
sign-module is requested then call sign_module from passes_0_4. stap-server
continues to mok sign using the same code path.