Frank Ch. Eigler [Mon, 14 Dec 2020 02:05:23 +0000 (21:05 -0500)]
PR23512: fix staprun/stapio operation via less-than-root privileges
Commit 7615cae790c899bc8a82841c75c8ea9c6fa54df3 for PR26665 introduced
a regression in handling stapusr/stapdev/stapsys gid invocation of
staprun/stapio. This patch simplifies the relevant code in
staprun/ctl.c, init_ctl_channel(), to rely on openat/etc. to populate
and use the relay_basedir_fd as much as possible. Also, we now avoid
unnecessary use of access(), which was checking against the wrong
(real rather than effective) uid/gid.
Frank Ch. Eigler [Fri, 11 Dec 2020 23:06:36 +0000 (18:06 -0500)]
staprun: handle more and fewer cpus better
NR_CPUS was a hard-coded minimum and maximum on the number of CPUs
worth of trace$N files staprun/stapio would open at startup. While a
constant is useful for array sizing (and so might as well be really
large), the actual iteration should be informed by get_nprocs_conf(3).
This patch replaces NR_CPUS with MAX_NR_CPUS (now 1024, why not), and
limits open/thread iterations to the actual number of processors. It
even prints an error if a behemoth >1K-core machine comes into being.
Frank Ch. Eigler [Fri, 11 Dec 2020 20:39:29 +0000 (15:39 -0500)]
relay transport: comment on STP_BULK message
While we've eliminated any STP_BULKMODE effects from the way relayfs
files are used ("always bulkmode"), staprun/stapio still need to know
whether the user intended "stap -b" or not, so they can save files
stpd_cpu* files separately.
Sultan Alsawaf [Thu, 10 Dec 2020 01:22:20 +0000 (17:22 -0800)]
always use per-cpu bulkmode relayfs files to communicate with userspace
Using a mutex_trylock() in __stp_print_flush() leads to a lot of havoc,
for numerous. Firstly, since __stp_print_flush() can be called from IRQ
context, holding the inode mutex from here would make the mutex owner
become nonsense, since mutex locks can only be held in contexts backed
by the scheduler. Secondly, the mutex_trylock implementation has a
spin_lock() inside of it that leads to two issues: IRQs aren't disabled
when acquiring this spin_lock(), so using it from IRQ context can lead
to a deadlock, and since spin locks can have tracepoints via
lock_acquire(), the spin_lock() can recurse on itself inside a stap
probe and deadlock, like so:
The reason the mutex_trylock() was needed in the first place was because
staprun doesn't properly use the relayfs API when reading buffers in
non-bulk mode. It tries to read all CPUs' buffers from a single thread,
when it should be reading each CPU's buffer from a thread running on
said CPU in order to utilize relayfs' synchronization guarantees, which
are made by disabling IRQs on the local CPU when a buffer is modified.
This change makes staprun always use per-CPU threads to read print
buffers so that we don't need the mutex_trylock() in the print flush
routine, which resolves a wide variety of serious bugs.
We also need to adjust the transport sub-buffer count to accommodate for
frequent print flushing. The sub-buffer size is now reduced to match the
log buffer size, which is 8192 by default, and the number of sub-buffers
is increased to 256. This uses exactly the same amount of memory as
before.
Frank Ch. Eigler [Thu, 10 Dec 2020 03:29:43 +0000 (22:29 -0500)]
PR27044: fix lock loop for conditional probes
Emit a nested block carefully so that the "goto out;" from a failed
stp_lock_probe() call in that spot near the epilogue of a
probe-handler goes downward, not upward.
Sultan Alsawaf [Wed, 9 Dec 2020 20:55:10 +0000 (12:55 -0800)]
PR26844: fix off-by-one error when copying printed backtraces
Since log->buf isn't null-terminated, log->len represents the total
number of bytes present in the log buffer for copying. The use of
strlcpy() here with log->len as its size results in log->len - 1 bytes
being copied, with the log->len'nth byte of the output buffer being set
to zero to terminate the string. Use memcpy() instead to remedy this,
while ensuring that the output buffer has space for null termination,
since the output buffer needs to be terminated.
This test case stresses nesting of heavy duty processing (backtrace
printing) within kernel interrupt processing paths. It seems to
sometimes trigger problems - so let's make the test harder to make
latent problems show up more likely. Instead of quitting after the
first irq_* function hit, stick around for 10 seconds.
Guillaume Morin [Fri, 4 Dec 2020 17:18:44 +0000 (12:18 -0500)]
PR27001: fix runtime/transport/transport.c lockdown build problem
On some kernel/configs, CONFIG_SECURITY_LOCKDOWN_LSM !=
STAPCONF_LOCKDOWN_DEBUGFS, which broke the runtime build.
Using the matching macro as detected by autoconf to fix.
Sultan Alsawaf [Thu, 3 Dec 2020 20:57:34 +0000 (12:57 -0800)]
runtime: fix print races in IRQ context and during print cleanup
Prints can race when there's a print called from IRQ context or a print
called while print cleanup takes place, which can lead to garbled print
messages, out-of-bounds memory accesses, and memory use-after-free. This
is one example of racy modification of the print buffer len in IRQ
context which caused a panic due to an out-of-bounds memory access:
This patch resolves the IRQ print races by disabling IRQs on the local
CPU when accessing said CPU's print buffer, and resolves the cleanup
races with a lock. We also protect against data corruption and panics
from prints inside NMIs now by checking if the current CPU was accessing
the log buffer when an NMI fired; in this case, the NMI's prints will be
dropped, as there is no way to safely service them without creating a
dedicated log buffer for them. This is achieved by forbidding reentrancy
with respect to _stp_print_trylock_irqsave() when the runtime context
isn't held. Reentrancy is otherwise allowed when the runtime context is
held because the runtime context provides reentrancy protection.
Note the deadlock due to _stp_transport_trylock_relay_inode recursing
onto itself via mutex_trylock.
This is a temporary fix for the issue until a proper patch is made to
remove the mutex_trylock from __stp_print_flush. This should be reverted
when that patch lands (it will have something to do with bulkmode).
Sultan Alsawaf [Wed, 2 Dec 2020 19:27:47 +0000 (11:27 -0800)]
task_finder_vma: add kfree_rcu() compat for old kernels
Newer RHEL 6 kernels have kfree_rcu(), but older ones do not. Using
kfree_rcu() is beneficial because it lets the RCU subsystem know that
the queued RCU callback is low-priority, and can be deferred, hence why
we don't replace kfree_rcu() with call_rcu() outright. Luckily,
kfree_rcu() is a macro so we can just #ifdef with it.
Alice Zhang [Fri, 27 Nov 2020 18:45:41 +0000 (13:45 -0500)]
Conscious language initiatives: replaced whitelist->passlist, blacklist->blocklist, master->main/primary. Some occurences of master and slave may not be able to be replaced at this point, eg. name of a terminology or usage of other programs interface.
Sultan Alsawaf [Wed, 2 Dec 2020 02:47:04 +0000 (18:47 -0800)]
runtime_context: replace _stp_context_lock with an atomic variable
We can't use any lock primitives here, such as spin locks or rw locks,
because lock_acquire() has tracepoints inside of it. This can cause a
deadlock, so we have to roll our own synchronization mechanism using an
atomic variable.
Sultan Alsawaf [Tue, 1 Dec 2020 17:54:07 +0000 (09:54 -0800)]
runtime_context: synchronize _stp_context_stop more strictly
We're only reading _stp_context_stop while the read lock is held, so we
can move the modification of it to inside the write lock to ensure
strict memory ordering. As such, it no longer needs to be an atomic_t
variable.
We also don't need to disable IRQs when holding the write lock because
only read_trylock is used from IRQ context, not read_lock, so there's no
possibility of a deadlock occurring.
Sultan Alsawaf [Tue, 24 Nov 2020 18:50:10 +0000 (10:50 -0800)]
runtime_context: factor out RCU usage using a rw lock
We can factor out the RCU insanity in here by just adding in a rw lock
and using that to synchronize _stp_runtime_contexts_free() with any code
that has the runtime context held.
Frank Ch. Eigler [Tue, 17 Nov 2020 21:34:59 +0000 (16:34 -0500)]
PR26665 detect rhel8 (4.18) era kernel_is_locked_down() as procfs trigger
A different older kernel API needs to be probed for rhel8 era detection
of lockdown in effect. Added an (undocumented) $SYSTEMTAP_NOSIGN env
var to override automatic --use-server on lockdown, so that one can
inspect runtime/autoconf* operation locally, without stap-server.
Sultan Alsawaf [Tue, 17 Nov 2020 19:03:53 +0000 (11:03 -0800)]
task_finder: call _stp_vma_done() upon error to fix memory leak
The memory allocated inside stap_initialize_vma_map() is not freed upon
error when the task finder is started because a call to _stp_vma_done()
in the error path is missing. Add it to fix the leak.
Jamie Bainbridge [Tue, 17 Nov 2020 17:50:04 +0000 (12:50 -0500)]
examples: add timestamp to dropwatch.stp
When using dropwatch.stp to troubleshoot packet drops, it is often done
with additional troubleshooting such as packet captures and collections
of other commands like "ethtool -S" or "netstat -s".
To correspond traffic loss events across the various output, these
should all have timestamps.
Add ctime timestamp to dropwatch to enable this. Update documentation to
show example timestamp collection.
Frank Ch. Eigler [Mon, 16 Nov 2020 23:54:11 +0000 (18:54 -0500)]
PR26665: mokutil output parsing tweaks
We encountered secureboot keys in the wild that didn't live up
to the expectations of the current little state machine. Tweaked
regexps to accept Issuer: O= as well as Issuer: CN= lines. With
more verbosity, produces output on parsing process.
Alice Zhang [Tue, 10 Nov 2020 18:11:13 +0000 (13:11 -0500)]
PR13838: Add float32 support and corresponding test cases
runtime/softfloat.* & runtime/softfloat/: add f32 support and f32 to f64
conversion
tapset/floatingpoint.stp: fixed some documentation typos & add f32_tp_f64
tapset function
testsuite/buildok/floatingpoint.stp: add f32 related test cases
main.cxx: add float parameter to sdt_benchmark_thread function for test purpose
runtime/softfloat.c & tapset/floatingpoint.stp : delete unnecessary functions
to keep the code concise
In utrace_report_syscall_entry and _exit, there is a possibility of
dereferencing a NULL pointer, in case __stp_utrace_alloc_task_work
exhausts UTRACE_TASK_WORK_POOL_SIZE live elements. While OOM is
still a possibility, this patch handles it more gracefully.
Frank Ch. Eigler [Tue, 10 Nov 2020 00:18:19 +0000 (19:18 -0500)]
PR26665: relayfs-on-procfs megapatch
On platforms/configurations where debugfs is inaccessible (I'm
side-eyeing at you, secureboot + kernel_lockdown), the stap runtime
needs another way to hook up the relayfs / .cmd files to talk to
staprun/stapio in userspace. kernel relayfs users all rely on
debugfs (tied closely to struct dentry*), and filesystems where
dentry*'s are not immediately available are SOL.
Until now. This gigapatch forks pieces of runtime/transport/transport.c
into debugfs and procfs alternatives. The debugfs fork is just like
before. The procfs fork is new, and uses a proc_dir_entry <-> struct
path look-up table to map between procfs objects and the dentry*'s
that relayfs so loves.
The debugfs alternative is default, except when lockdown mode is
detected; then the runtime chooses procfs_p at the strategic moment.
stap -DSTAP_TRANS_PROCFS or -DSTAP_TRANS_DEBUGFS lets the user
override this heuristic. (Going to a procfs default is worth
considering at some point.)
The staprun/stapio userspace is updated to search both
/sys/kernel/debug/systemtap and /proc/systemtap for the relay/.cmd
file endpoints.
Most of this gigapatch is moving code around in runtime/transport/ so
relay_v2 is agnostic to its enclosing filesystem, going through hooks
in transport.c to either procfs.c or debugfs.c. The old
runtime/procfs.c file is stripped down to move common bits around a
little.
William Cohen [Mon, 9 Nov 2020 18:01:06 +0000 (13:01 -0500)]
Initialize variable in runtime/softfloat.c to avoid RHEL8 -Werror issue
Make sure that the variable is initialized to something to avoid the
following error when running the testsuite on RHEL8:
attempting command stap -p4 floatingpoint.stp -c "stap --benchmark-sdt"
OUT In file included from /tmp/stapBRN9va/stap_825f154f474bfd5b2080a28426f65178_4743_src.c:37:
/usr/share/systemtap/runtime/softfloat.c: In function 'softfloat_shiftRightJamM':
/usr/share/systemtap/runtime/softfloat.c:132:34: error: 'ptr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
uint32_t wordJam, wordDist, *ptr;
^~~
cc1: all warnings being treated as errors
make[3]: *** [scripts/Makefile.build:315: /tmp/stapBRN9va/stap_825f154f474bfd5b2080a28426f65178_4743_src.o] Error 1
make[2]: *** [Makefile:1544: _module_/tmp/stapBRN9va] Error 2
WARNING: kbuild exited with status: 2
Pass 4: compilation failed. [man error::pass4]
child process exited abnormally
RC 1
FAIL: systemtap.examples/general/floatingpoint build
Sultan Alsawaf [Thu, 5 Nov 2020 21:39:30 +0000 (13:39 -0800)]
PR26144: task_finder2: execute task workers in order
The task finder's task workers need to be executed in the order that
they are added, but the kernel's task_work API doesn't make any ordering
guarantees, so task workers end up getting executed out of order. This
becomes a problem when the mmap callback worker runs after the other two
workers the task finder uses, even though it gets queued beforehand.
We can make the task finder's task workers run in order by wrapping the
task worker API with our own routines to dequeue task workers from a
global list and run them in the correct order. A lot of the scaffolding
needed to achieve this is already present, so this change is not too
invasive.
Aaron Merey [Thu, 5 Nov 2020 17:46:31 +0000 (12:46 -0500)]
Makefile.am: Install runtime/softfloat/
Previously the runtime/softfloat directory was not installed when
building systemtap. This lead to errors when trying to use systemtap's
floating point facilities.
Modify Makefile.am so that this directory is installed during a build.
Sultan Alsawaf [Mon, 2 Nov 2020 23:53:09 +0000 (15:53 -0800)]
PR26846: task_finder2: fix kernel panics by eliminating in_atomic() usage
With non-PREEMPT kernels (i.e., kernels with CONFIG_PREEMPT=n),
in_atomic() cannot detect when the current context is within a spin lock
or RCU read-side critical section. Since the syscall tracepoints are
executed from within an RCU read-side critical section (see
__DO_TRACE()), this means that in_atomic() won't know that the current
context doesn't allow sleeping. When this happens, we see kernel panics
occurring in stap's registered tracepoints, like this one:
Panics like this occur in all of stap's registered tracepoints. To fix
them, just defer the mmap callbacks to a task worker all the time. That
way, we never need to worry about handling them in a safe context.
If routine runtime errors occur during execution, the c->last_stmt
variable is printed to the user as to best suspected script location
of the failure. As an optimization, this variable is not set at every
little point during statement/expression evaluation that are not
likely to cause errors. But we overlooked one spot where it's
absolutely needed: around function calls, especially into synthetic
embedded-c functions that process $context variables. That meant that
error messages could misidentify some other recent but nonspecific
point for an error.
Now we add a c->last_stmt set immediately before each function call,
after its actual arguments are executed. This placement also covers
the case where the arguments themselves might fail during evaluation.
Serhei Makarov [Wed, 4 Nov 2020 20:50:49 +0000 (15:50 -0500)]
PR26811 WIP: adapt to set_fs() removal in linux 5.10+
WIP since there are still a few faults in evidence e.g. on check.exp whythefail
Introduce STAPCONF_SET_FS to identify if set_fs is present.
After kernel 5.10 on arches removing set_fs(), kernel
addresses should be read/written with get_kernel_nofault and
copy_to_kernel_nofault while user addresses are still read/written
with __get_user and __put_user. So we have wrapper macros
__stp_{get,put}_either which do the right thing on all kernel
versions.
Also, since KERNEL_DS and USER_DS are no longer available, introduce
STP_KERNEL_DS and STP_USER_DS. These map to KERNEL_DS and USER_DS on
older kernels.
Also, modify loc2c-runtime.h dereferencing functions and lookup_bad_addr
to take STP_KERNEL_DS/STP_USER_DS parameters specifying the address space
to dereference in.
Sultan Alsawaf [Sat, 31 Oct 2020 07:02:12 +0000 (00:02 -0700)]
stp_task_work: don't busy poll in stp_task_work_exit()
Instead of doing a busy poll and forcefully sleeping for one jiffy every
time stp_task_work_exit() checks to see if all the task workers are
finished, just use a wait event and have the last task worker wake up
stp_task_work_exit() when it's finished. This is faster and more
efficient, since there's no uninterruptible sleeping for exactly one
jiffy at a time, and there's no polling involved.
Alice Zhang [Fri, 30 Oct 2020 06:33:01 +0000 (02:33 -0400)]
PR13838: Added basic floating point support to systemtap
runtime/softfloat.*: including floating point type definition
runtime/softfloat/*: all other required auxiliary functions
These are from https://github.com/ucb-bar/berkeley-softfloat-3
by John R. Hauser, thanks!
tapset/floatingpoing.stp: including fp conversion, fp arithmetic and
comparison functions testsuite/buildok/floatingpoint.stp: including testcase
for corresponding floatingpoing tapset main.cxx: changed sdt_benchmark part
of code for a demo of extracting floating point
Systemtap support 64 bit floating pounint (double type) under ieee754.
Conversions(fp <-> long, fp <-> string), arithmetic(add, sub, div, mul, sqrt)
and comparison between fp(less than, less than or equal to, equal) are
supported, corresponding tapset functions and test case are provided as well.
amerey [Fri, 7 Aug 2020 22:58:33 +0000 (18:58 -0400)]
PR26015: Make syscall arguments writable again
Make syscall arguments writable again in non-DWARF probes on kernels
that use syscall wrappers to pass arguments via pt_regs (currently
x86_64 4.17+ and aarch64 4.19+).
For non-DWARF syscall probes also add an additional probe variable
for each syscall string parameter that holds an unquoted version
of the string parameter. Modifying this variable within the probe
will cause the string it holds to be written to the userspace string
buffer that was passed to the syscall.
Sagar Patel [Thu, 29 Oct 2020 23:34:41 +0000 (19:34 -0400)]
PR26015: Add @probewrite predicate.
The @probewrite predicate checks whether an identifier has been
written to in the probe handler body. The identifier can be either
a script variable or target variable. @probewrite(var) returns 1
if var has been written to in the probe handler body, else 0.
For example,
probe foo = begin { var = 0 }, { if (@probewrite(var)) println(var) }
probe foo { var = 1 }
The @probewrite predicate would resolve to 1 in this case and the
new value of var would be printed.
1) Added probewrite_op.
2) Designed probewrite_evaluator to resolve @probewrite checks.
3) Designed symuse_collecting_visitor (similar to varuse_collecting_visitor).
3) Updated several other visitors accordingly.
4) Added test cases.
5) Updated NEWS.
Sultan Alsawaf [Thu, 29 Oct 2020 18:25:53 +0000 (11:25 -0700)]
task_finder2: change the default engine action to UTRACE_INTERRUPT
There is a race condition where, right after an engine is attached, a
reporting pass will occur before the engine can actually request what it
wants from the target process. In this case, the action that the engine
used when it was first attached will be carried out during the reporting
pass. When the default action is UTRACE_STOP, this means that the
reporting pass will think the newly-attached engine wants to stop the
target process, at which point the target process will be moved into the
TASK_TRACED state (visible via `ps aux | grep ' t '`) and will be
halted forever (until it receives a SIGKILL) because the engine will
never send a UTRACE_RESUME request to bring the target process back to
life. This seems to be an issue with the UTRACE_STOP machinery; it's not
clear how *any* process entering the UTRACE_STOP state can exit that
state naturally. It's also dubious whether the UTRACE_STOP state is even
needed, since tracing is done from within task workers that run inside
the context of the process we're trying to analyze, which allows us to
to safely analyze the process without needing to stop it.
Regardless, it's clear that a newly-attached engine would definitely not
want to stop the process it's trying to analyze; after all, there's
nothing interesting to see if the process is just halted. The common
engine action seems to be UTRACE_INTERRUPT, so let's set that to be the
default instead of UTRACE_STOP.
task_finder2: don't attach to forked children when the target PID is specified
When we have a PID specified for tracing and a fork occurs from our
target PID, the forked child will have the same exe as our target and
will subsequently get matched and attached to by
__stp_utrace_attach_match_filename(). Attaching to these children is not
productive though, since we are only interested in a specific process.
Therefore, as an optimization, only bother trying to attach to forked
children when the target PID is *not* specified. When the target PID is
specified (via -x PID) and match_tsk != path_tsk, we know that a fork
just occurred and match_tsk is the child of path_tsk, in which case
we should just skip attaching to match_tsk.
Sultan Alsawaf [Thu, 29 Oct 2020 08:24:43 +0000 (01:24 -0700)]
task_finder: error out when we cannot attach to _stp_target
In order to avoid sleeping, stap_find_exe_file() does a trylock attempt
on an mm's mmap semaphore and returns NULL when the lock is contented.
When this happens, it can cause the task finder to not attach to a
desired target process. This is especially noticeable when a target PID
is specified, in which case the target PID itself can get skipped over
by the task finder.
Therefore, we should treat failures to get the exe file for a specific
target PID as fatal, since that means the target PID will never get
attached. Note that we must return a negative value from
stap_start_task_finder() in order for the fatal error to be honored, so
we shouldn't negate PTR_ERR(mmpath).
Frank Ch. Eigler [Wed, 28 Oct 2020 00:02:18 +0000 (20:02 -0400)]
testsuite: current.stp module("*") defang
Like for server_concurrency*, the current.stp test case has excessive
debuginfo requirements. We still want -some- decent workload, so
chose usb* as the module wildcard. Far smaller than the "*" there
formerly.
Sultan Alsawaf [Tue, 27 Oct 2020 22:00:49 +0000 (15:00 -0700)]
stp_utrace: replace task_utrace_lock with non-blocking RCU read locks
The global task_utrace_lock is highly contented and results in a lot of
CPU time wasted spinning on it, especially since it's not a r/w lock.
It turns out we can replace all of the task_utrace_lock usage with
non-blocking RCU read locks instead to improve performance. Now, reads
to any of the hash list buckets containing the utrace entries do not
block and can occur concurrently with other readers, and writes to any
hash list won't block readers thanks to the magic of RCU. The only
locking needed is between concurrent writes to a single hash list, and
a per-bucket spin lock is used to achieve this instead of a sprawling
global lock.
Stan Cox [Tue, 27 Oct 2020 15:23:06 +0000 (11:23 -0400)]
Merge branch 'scox/tls': Add tls support.
This merges support for accessing implicit tls variables.
Given a DW_OP_GNU_push_tls_address dwarf entry,
tls.stp::__push_tls_address handles navigating the tls data
structures. stp_tls.h contains minimal versions of a few essential
tls data structures.
Frank Ch. Eigler [Tue, 27 Oct 2020 02:01:24 +0000 (22:01 -0400)]
step-prep: check for debuginfod capability
Test /usr/bin/debuginfod-find for a vdso*so in the kernel. If
successful, avoid downloading big kernel debuginfo files now,
assuming that the debuginfo server(s) will remain available.
fs/kernel_file_read: Add "offset" arg for partial reads
To perform partial reads, callers of kernel_read_file*() must have a
non-NULL file_size argument and a preallocated buffer. The new "offset"
argument can then be used to seek to specific locations in the file to
fill the buffer to, at most, "buf_size" per call.
Where possible, the LSM hooks can report whether a full file has been
read or not so that the contents can be reasoned about.
This and the preceding commits changed the function signature from:
To fight the oobleck of giant test log files (ok to store with Bunsen
but hard to display in a browser without pagination tricks), reduce
the most verbose testcases.
unprivileged_embedded_C.exp bombards us with pass2 output.
(Always printed even without -v flag.)
Options:
1) suppress stap stdout but not stderr with sh -c wrapper + redirection
2) separate and drop stdout in the expect script (not feasible)
3) add a secret --silent-p2 option to stap which drops pass2 output
Sultan Alsawaf [Thu, 22 Oct 2020 04:50:16 +0000 (21:50 -0700)]
task_finder_vma: add autoconf check for atomic_fetch_add_unless()
Some kernels have atomic_fetch_add_unless() backported, such as
4.18.0-240.5.el8.aarch64 and 4.18.0-193.28.1.el8_2.x86_64 on rhel8,
so we cannot rely on the kernel version to determine whether or not
the function is present. Add an autoconf stub to check for it.
Optimize: increase the default size of the vma map hash table
Hash conflicts is severe when the vma tracker keeps track of many
processes given the current hash table size of mere 16 buckets.
Increased it to 256 buckets by default and also make it tunable
via the macro __STP_TF_HASH_BITS.
In our test, this significantly reduces hash conflicts when
stap/staprun's -x PID option is not specified.
Sultan Alsawaf [Wed, 21 Oct 2020 19:27:24 +0000 (12:27 -0700)]
task_finder_vma: rewrite using RCU to fix performance issues
The use of a single global rwlock to protect this file's hash table
results in significantly degraded performance when there are many
processes using the vma tracker in flight. A lot of time is spent
spinning on the rwlock when this happens. For exmaple, it is using
most of the CPU time in the following kernel-space CPU flame graph:
There are other code paths which would invoke the same spinlock, as in
_stp_umodule_relocate().
To remedy this, make the hash table RCU safe so we'll never block upon
reading a hash list.
We now use the hash_ptr() function to generate the hashes, and the task
pointers themselves are hashed now instead of their PID for reliability,
since PIDs are not a stable anchor point to a task struct.
While we're at it, clean up the rest of this file to bring it up to
current Linux kernel coding standards as well.
This leads to dramatic CPU time reduction when
1. the current system has a lot of running processes, or
2. some processes have a lot of DSO dependencies, and
3. also -x PID is not used for stap or staprun, and
4. there are quite a few CPU cores.
For a typical test run, we have the following CPU utilization changes:
Serhei Makarov [Wed, 21 Oct 2020 20:06:48 +0000 (16:06 -0400)]
PR26755 temporary kprobes_onthefly.exp: also disable m* on ppc
XXX Need to investigate which of the tracepoints under m* is failing. XXX
For now, blacklist these in kprobes_onthefly.exp to allow the buildbot
testsuite to finish running.
An onthefly testcase started freezing the kernel on the 'hardcore'
case which grabs lock:lock_{acquire,acquired,release,contended} tracepoints.
I'm still tracking down which kernel change caused this
since the tracepoints exist on older codebases (probably lockdep kernels),
but for now it seems that we should avoid probing these in onthefly testing.
Stan Cox [Fri, 9 Oct 2020 20:51:57 +0000 (16:51 -0400)]
Support tls on ppcle
Support tls on ppcle as described in "OpenPOWER ABI for Linux Supplement
Power Architecture 64-Bit ELF V2 ABI." Use glibc debuginfo instead of constant
offset to access link_map->l_tls_modid. Remove set_tls_module_by_addr.
Sultan Alsawaf [Thu, 1 Oct 2020 22:19:47 +0000 (15:19 -0700)]
PR26697: fix NULL pointer deref in get_utrace_lock()
task_utrace_struct() can return NULL via __task_utrace_struct(). This fixes
the following crash:
BUG: unable to handle kernel NULL pointer dereference at (null)
#9 [ffff8843e56ffd20] get_utrace_lock at ffffffffc08258c6 [stap_X_40544]
The reason why it can return NULL is because engine->ops is protected by
utrace->lock, but we don't have the utrace pointer, and the purpose of
get_utrace_lock() is to get the utrace pointer. Therefore, there's no way
to ensure engine->ops remains unchanged inside get_utrace_lock(), so
get_utrace_lock()'s checks on engine->ops can be incorrect/stale, which
leads to the NULL pointer dereference.
Previous logic for detecting whether module-signing was required
was broken by the presence of non-systemtap MOK keys. This led
the client to find no matching stap-servers, leading to no
compilation attempt or MOK assignment. New code filters local
MOK keys for Systemtap ones only, and even in an absence,
communicates the need for a signature via a "missing" marker.
The server eagerly passes back (new or old) MOK keys to such a
client now.
Tested on a rhel8 uefi/secureboot kvm vm. Transport /sys/debug
dependencies are still blocking full function due to kernel_lockdown,
but that's coming next.