sourceware.org Git - systemtap.git/log

stp_utrace: disable IRQs when holding the bucket spin lock

This lock can be acquired from inside an IRQ, leading to a deadlock:

WARNING: inconsistent lock state
4.14.35-1902.6.6.el7uek.x86_64.debug #2 Tainted: G           OE
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
sh/15779 [HC1[1]:SC0[0]:HE0:SE1] takes:
(&(lock)->rlock#3){?.+.}, at: [<ffffffffc0c080b0>] _stp_mempool_alloc+0x35/0xab [orxray_lj_lua_fgraph_XXXXXXX]
{HARDIRQ-ON-W} state was registered at:
  lock_acquire+0xe0/0x238
  _raw_spin_lock+0x3d/0x7a
  utrace_task_alloc+0xa4/0xe3 [orxray_lj_lua_fgraph_XXXXXXX]
  utrace_attach_task+0x136/0x194 [orxray_lj_lua_fgraph_XXXXXXX]
  __stp_utrace_attach+0x57/0x216 [orxray_lj_lua_fgraph_XXXXXXX]
  stap_start_task_finder+0x12e/0x33f [orxray_lj_lua_fgraph_XXXXXXX]
  systemtap_module_init+0x114d/0x11f0 [orxray_lj_lua_fgraph_XXXXXXX]
  _stp_handle_start+0xea/0x1c5 [orxray_lj_lua_fgraph_XXXXXXX]
  _stp_ctl_write_cmd+0x28d/0x2d1 [orxray_lj_lua_fgraph_XXXXXXX]
  full_proxy_write+0x67/0xbb
  __vfs_write+0x3a/0x170
  vfs_write+0xc7/0x1c0
  SyS_write+0x58/0xbf
  do_syscall_64+0x7e/0x22c
  entry_SYSCALL_64_after_hwframe+0x16e/0x0
irq event stamp: 9454
hardirqs last  enabled at (9453): [<ffffffffa696c960>] _raw_write_unlock_irqrestore+0x40/0x67
hardirqs last disabled at (9454): [<ffffffffa6a05417>] apic_timer_interrupt+0x1c7/0x1d1
softirqs last  enabled at (9202): [<ffffffffa6c00361>] __do_softirq+0x361/0x4e5
softirqs last disabled at (9195): [<ffffffffa60aeb76>] irq_exit+0xf6/0x102

other info that might help us debug this:
Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(lock)->rlock#3);
  <Interrupt>
    lock(&(lock)->rlock#3);

*** DEADLOCK ***

no locks held by sh/15779.

stack backtrace:
CPU: 16 PID: 15779 Comm: sh Tainted: G           OE   4.14.35-1902.6.6.el7uek.x86_64.debug #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
<IRQ>
dump_stack+0x81/0xb6
print_usage_bug+0x1fc/0x20d
? check_usage_backwards+0x130/0x12b
mark_lock+0x1f8/0x27b
__lock_acquire+0x6e7/0x165a
? sched_clock_local+0x18/0x81
? perf_swevent_hrtimer+0x136/0x151
lock_acquire+0xe0/0x238
? _stp_mempool_alloc+0x35/0xab [orxray_lj_lua_fgraph_XXXXXXX]
_raw_spin_lock_irqsave+0x55/0x97
? _stp_mempool_alloc+0x35/0xab [orxray_lj_lua_fgraph_XXXXXXX]
_stp_mempool_alloc+0x35/0xab [orxray_lj_lua_fgraph_XXXXXXX]
_stp_ctl_get_buffer+0x69/0x215 [orxray_lj_lua_fgraph_XXXXXXX]
_stp_ctl_send+0x4e/0x169 [orxray_lj_lua_fgraph_XXXXXXX]
_stp_vlog+0xac/0x143 [orxray_lj_lua_fgraph_XXXXXXX]
? _stp_utrace_probe_cb+0xa4/0xa4 [orxray_lj_lua_fgraph_XXXXXXX]
_stp_warn+0x6a/0x88 [orxray_lj_lua_fgraph_XXXXXXX]
function___global_warn__overload_0+0x60/0xac [orxray_lj_lua_fgraph_XXXXXXX]
probe_67+0xce/0x10e [orxray_lj_lua_fgraph_XXXXXXX]
_stp_hrtimer_notify_function+0x2db/0x55f [orxray_lj_lua_fgraph_XXXXXXX]
__hrtimer_run_queues+0x132/0x5c5
hrtimer_interrupt+0xb7/0x1ca
smp_apic_timer_interrupt+0xa5/0x35a
apic_timer_interrupt+0x1cc/0x1d1
</IRQ>

PR13838: Add float32 support and corresponding test cases

runtime/softfloat.* & runtime/softfloat/: add f32 support and f32 to f64
conversion
tapset/floatingpoint.stp: fixed some documentation typos & add f32_tp_f64
tapset function
testsuite/buildok/floatingpoint.stp: add f32 related test cases
main.cxx: add float parameter to sdt_benchmark_thread function for test purpose

runtime/softfloat.c & tapset/floatingpoint.stp : delete unnecessary functions
to keep the code concise

RHBZ1892179: handle exhausted stp_task_work structs

In utrace_report_syscall_entry and _exit, there is a possibility of
dereferencing a NULL pointer, in case __stp_utrace_alloc_task_work
exhausts UTRACE_TASK_WORK_POOL_SIZE live elements. While OOM is
still a possibility, this patch handles it more gracefully.

releng: update-po

regenerate po/* files

PR26665: relayfs-on-procfs megapatch, rhel6 tweaks

A few more compatibility macros needed to be moved over to transport/procfs.c.

pre-release: version timestamping, NEWS tweaks

pre-release: regenerate example index

pre-release: update-docs

Regenerate man pages and pdf docs.

testsuite tweak: buildok/floatingpoint.stp chmod a+x

PR26665: relayfs-on-procfs megapatch

On platforms/configurations where debugfs is inaccessible (I'm
side-eyeing at you, secureboot + kernel_lockdown), the stap runtime
needs another way to hook up the relayfs / .cmd files to talk to
staprun/stapio in userspace.  kernel relayfs users all rely on
debugfs (tied closely to struct dentry*), and filesystems where
dentry*'s are not immediately available are SOL.

Until now.  This gigapatch forks pieces of runtime/transport/transport.c
into debugfs and procfs alternatives. The debugfs fork is just like
before. The procfs fork is new, and uses a proc_dir_entry <-> struct
path look-up table to map between procfs objects and the dentry*'s
that relayfs so loves.

The debugfs alternative is default, except when lockdown mode is
detected; then the runtime chooses procfs_p at the strategic moment.
stap -DSTAP_TRANS_PROCFS or -DSTAP_TRANS_DEBUGFS lets the user
override this heuristic.  (Going to a procfs default is worth
considering at some point.)

The staprun/stapio userspace is updated to search both
/sys/kernel/debug/systemtap and /proc/systemtap for the relay/.cmd
file endpoints.

Most of this gigapatch is moving code around in runtime/transport/ so
relay_v2 is agnostic to its enclosing filesystem, going through hooks
in transport.c to either procfs.c or debugfs.c.  The old
runtime/procfs.c file is stripped down to move common bits around a
little.

Signed-off-by: Frank Ch. Eigler <fche@redhat.com>

transport relay_v2: drop "dropped" facility

Nothing's consuming the "dropped" debugfs file as per
-D_STP_USE_DROPPED_FILE, so drop this logic for simplicity.

Signed-off-by: Frank Ch. Eigler <fche@redhat.com>

Initialize variable in runtime/softfloat.c to avoid RHEL8 -Werror issue

Make sure that the variable is initialized to something to avoid the
following error when running the testsuite on RHEL8:

attempting command stap -p4 floatingpoint.stp -c "stap --benchmark-sdt"
OUT In file included from /tmp/stapBRN9va/stap_825f154f474bfd5b2080a28426f65178_4743_src.c:37:
/usr/share/systemtap/runtime/softfloat.c: In function 'softfloat_shiftRightJamM':
/usr/share/systemtap/runtime/softfloat.c:132:34: error: 'ptr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
     uint32_t wordJam, wordDist, *ptr;
                                  ^~~
cc1: all warnings being treated as errors
make[3]: *** [scripts/Makefile.build:315: /tmp/stapBRN9va/stap_825f154f474bfd5b2080a28426f65178_4743_src.o] Error 1
make[2]: *** [Makefile:1544: _module_/tmp/stapBRN9va] Error 2
WARNING: kbuild exited with status: 2
Pass 4: compilation failed.  [man error::pass4]
child process exited abnormally
RC 1
FAIL: systemtap.examples/general/floatingpoint build

task_finder2: fix memory leak when task workers fail to get added

None of the error paths for the __stp_tf_task_work_add() calls free the
tf_work allocation when the task_work_add fails. This fixes that.

This also makes a nitpick to __stp_tf_task_worker_fn() to reduce the
critical section of __stp_tf_task_work_list_lock.

Reported-by: Frank Ch. Eigler <fche@redhat.com>
Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

man/stapprobes.3stap: Mention nd_syscall argument writing.

prerelease: update-docs

tapset/uconversions.stp: Fix format of user_string_n_nofault

Function description needs to be on one line in order for
doc generation to work.

tapset/uconversions.stp: Fix user_string_n_nofault description.

Fix description to correctly state that the empty string is returned
when userspace data is not accessible

prerelease: AUTHORS bump

prerelease: update-po

PR26144: task_finder2: execute task workers in order

The task finder's task workers need to be executed in the order that
they are added, but the kernel's task_work API doesn't make any ordering
guarantees, so task workers end up getting executed out of order. This
becomes a problem when the mmap callback worker runs after the other two
workers the task finder uses, even though it gets queued beforehand.

We can make the task finder's task workers run in order by wrapping the
task worker API with our own routines to dequeue task workers from a
global list and run them in the correct order. A lot of the scaffolding
needed to achieve this is already present, so this change is not too
invasive.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

PR13838: update fp systemtap example

testsuite/systemtap.examples/general/floatingpoint.stp&floatingpoint.txt: fixed
typo and add print for initial fp a, b, c

PR13838: Fix previous commit message (c80f1453eba9430921edd4dc10e93f8d993042da)

PR13838: add floating point to systemtap.examples

testsuite/systemtap.examples/general/floatingpoint.*: add a demo for extracting
fp and performing some basic fp operations.

instead of printing out results of every operation, print out combinations of two to three
operations.

PR13838: add floating point to systemtap.examples

testsuite/systemtap.examples/general/floatingpoint.*: add a demo for extracting
fp and performing some basic fp operations.

printing out results of every operation, print out combinations of two to three
operations.

Makefile.am: Install runtime/softfloat/

Previously the runtime/softfloat directory was not installed when
building systemtap. This lead to errors when trying to use systemtap's
floating point facilities.

Modify Makefile.am so that this directory is installed during a build.

PR26846: task_finder2: fix kernel panics by eliminating in_atomic() usage

With non-PREEMPT kernels (i.e., kernels with CONFIG_PREEMPT=n),
in_atomic() cannot detect when the current context is within a spin lock
or RCU read-side critical section. Since the syscall tracepoints are
executed from within an RCU read-side critical section (see
__DO_TRACE()), this means that in_atomic() won't know that the current
context doesn't allow sleeping. When this happens, we see kernel panics
occurring in stap's registered tracepoints, like this one:

kernel tried to execute NX-protected page - exploit attempt? (uid: 99)
BUG: unable to handle kernel paging request at ffffffffc1ea7040
IP: [<ffffffffc1ea7040>] _stp_module_3+0x0/0xffffffffffed9fc0 [orxray_c_fgraph_XX_3673]
PGD 1c1814067 PUD 1c1816067 PMD 486e4067 PTE 8000000164606063
Oops: 0011 [#1] SMP
CPU: 39 PID: 6934 Comm: sh Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.4.2.el7.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
task: ffff943dc3d5b150 ti: ffff943dc27d4000 task.ti: ffff943dc27d4000
RIP: 0010:[<ffffffffc1ea7040>]  [<ffffffffc1ea7040>] _stp_module_3+0x0/0xffffffffffed9fc0 [orxray_c_fgraph_XX_3673]
RSP: 0018:ffff943dc27d7ea8  EFLAGS: 00010282
RAX: ffffffffc1ea7040 RBX: ffff943dc3d5b150 RCX: ffff943d537f4300
RDX: 0000000000001b16 RSI: ffff943dc3d5b150 RDI: 0000000000000000
RBP: ffff943dc27d7f28 R08: 0000000000000000 R09: 0000000180490016
R10: ffff943d537f4300 R11: ffff943d5cd62930 R12: ffff943dc4e38000
R13: 0000000000001b16 R14: 0000000000001b16 R15: ffff943e519351d0
FS:  0000000000000000(0000) GS:ffff943f76fc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc1ea7040 CR3: 000000016d4b8000 CR4: 0000000000340fe0
Call Trace:
[<ffffffffa6e52c64>] ? do_execve_common.isra.24+0x7e4/0x880
[<ffffffffa6e52f99>] SyS_execve+0x29/0x30
[<ffffffffa738d478>] stub_execve+0x48/0x80

Note that the panic occurs from the execve syscall, where stap has a
tracepoint registered:

rc = STP_TRACE_REGISTER(sched_process_exec, utrace_report_exec);

Panics like this occur in all of stap's registered tracepoints. To fix
them, just defer the mmap callbacks to a task worker all the time. That
way, we never need to worry about handling them in a safe context.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

translator: disambiguate runtime errors better

If routine runtime errors occur during execution, the c->last_stmt
variable is printed to the user as to best suspected script location
of the failure.  As an optimization, this variable is not set at every
little point during statement/expression evaluation that are not
likely to cause errors.  But we overlooked one spot where it's
absolutely needed: around function calls, especially into synthetic
embedded-c functions that process $context variables.  That meant that
error messages could misidentify some other recent but nonspecific
point for an error.

Now we add a c->last_stmt set immediately before each function call,
after its actual arguments are executed.  This placement also covers
the case where the arguments themselves might fail during evaluation.

PR26811 WIP: adapt to set_fs() removal in linux 5.10+

WIP since there are still a few faults in evidence e.g. on check.exp whythefail

Introduce STAPCONF_SET_FS to identify if set_fs is present.

After kernel 5.10 on arches removing set_fs(), kernel
addresses should be read/written with get_kernel_nofault and
copy_to_kernel_nofault while user addresses are still read/written
with __get_user and __put_user. So we have wrapper macros
__stp_{get,put}_either which do the right thing on all kernel
versions.

Also, since KERNEL_DS and USER_DS are no longer available, introduce
STP_KERNEL_DS and STP_USER_DS. These map to KERNEL_DS and USER_DS on
older kernels.

Also, modify loc2c-runtime.h dereferencing functions and lookup_bad_addr
to take STP_KERNEL_DS/STP_USER_DS parameters specifying the address space
to dereference in.

stp_task_work: don't busy poll in stp_task_work_exit()

Instead of doing a busy poll and forcefully sleeping for one jiffy every
time stp_task_work_exit() checks to see if all the task workers are
finished, just use a wait event and have the last task worker wake up
stp_task_work_exit() when it's finished. This is faster and more
efficient, since there's no uninterruptible sleeping for exactly one
jiffy at a time, and there's no polling involved.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

stp_utrace: reset the correct atomic var when resume work fails to queue

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

Adapt debugpath.exp to the debuginfod feature.

PR13838: Added basic floating point support to systemtap

runtime/softfloat.*: including floating point type definition
runtime/softfloat/*: all other required auxiliary functions

These are from https://github.com/ucb-bar/berkeley-softfloat-3
by John R. Hauser, thanks!

tapset/floatingpoing.stp: including fp conversion, fp arithmetic and
comparison functions testsuite/buildok/floatingpoint.stp: including testcase
for corresponding floatingpoing tapset main.cxx: changed sdt_benchmark part
of code for a demo of extracting floating point

Systemtap support 64 bit floating pounint (double type) under ieee754.
Conversions(fp <-> long, fp <-> string), arithmetic(add, sub, div, mul, sqrt)
and comparison between fp(less than, less than or equal to, equal) are
supported, corresponding tapset functions and test case are provided as well.

PR26015: Make syscall arguments writable again

Make syscall arguments writable again in non-DWARF probes on kernels
that use syscall wrappers to pass arguments via pt_regs (currently
x86_64 4.17+ and aarch64 4.19+).

For non-DWARF syscall probes also add an additional probe variable
for each syscall string parameter that holds an unquoted version
of the string parameter. Modifying this variable within the probe
will cause the string it holds to be written to the userspace string
buffer that was passed to the syscall.

PR26015: Add @probewrite predicate.

The @probewrite predicate checks whether an identifier has been
written to in the probe handler body. The identifier can be either
a script variable or target variable. @probewrite(var) returns 1
if var has been written to in the probe handler body, else 0.

For example,

probe foo = begin { var = 0 }, { if (@probewrite(var)) println(var) }
probe foo { var = 1 }

The @probewrite predicate would resolve to 1 in this case and the
new value of var would be printed.

1) Added probewrite_op.
2) Designed probewrite_evaluator to resolve @probewrite checks.
3) Designed symuse_collecting_visitor (similar to varuse_collecting_visitor).
3) Updated several other visitors accordingly.
4) Added test cases.
5) Updated NEWS.

Allow individual probes to have both a prologue and epilogue.

Also add new syntax for defining combined prologue and epilogue:
'probe ALIAS = PROBE { <prologue> }, { <epilogue> }'

NEWS: mentioned the utrace task hash table optimization

Also mentioned the default hash table size increase.

task_finder2: change the default engine action to UTRACE_INTERRUPT

There is a race condition where, right after an engine is attached, a
reporting pass will occur before the engine can actually request what it
wants from the target process. In this case, the action that the engine
used when it was first attached will be carried out during the reporting
pass. When the default action is UTRACE_STOP, this means that the
reporting pass will think the newly-attached engine wants to stop the
target process, at which point the target process will be moved into the
TASK_TRACED state (visible via `ps aux | grep ' t '`) and will be
halted forever (until it receives a SIGKILL) because the engine will
never send a UTRACE_RESUME request to bring the target process back to
life. This seems to be an issue with the UTRACE_STOP machinery; it's not
clear how *any* process entering the UTRACE_STOP state can exit that
state naturally. It's also dubious whether the UTRACE_STOP state is even
needed, since tracing is done from within task workers that run inside
the context of the process we're trying to analyze, which allows us to
to safely analyze the process without needing to stop it.

Regardless, it's clear that a newly-attached engine would definitely not
want to stop the process it's trying to analyze; after all, there's
nothing interesting to see if the process is just halted. The common
engine action seems to be UTRACE_INTERRUPT, so let's set that to be the
default instead of UTRACE_STOP.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

task_finder2: don't attach to forked children when the target PID is specified

When we have a PID specified for tracing and a fork occurs from our
target PID, the forked child will have the same exe as our target and
will subsequently get matched and attached to by
__stp_utrace_attach_match_filename(). Attaching to these children is not
productive though, since we are only interested in a specific process.

Therefore, as an optimization, only bother trying to attach to forked
children when the target PID is *not* specified. When the target PID is
specified (via -x PID) and match_tsk != path_tsk, we know that a fork
just occurred and match_tsk is the child of path_tsk, in which case
we should just skip attaching to match_tsk.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

Bug: deadlocks might happen in the spinlocks when -DDEBUG_MEM is specified

Now we always save the irq state in our debug mem allocator's spinlocks.

One sample CPU soft lockup backtrace in the stap ko:

https://gist.github.com/agentzh/68d4ef9574f69595c5d19da3688b8981

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

task_finder: error out when we cannot attach to _stp_target

In order to avoid sleeping, stap_find_exe_file() does a trylock attempt
on an mm's mmap semaphore and returns NULL when the lock is contented.
When this happens, it can cause the task finder to not attach to a
desired target process. This is especially noticeable when a target PID
is specified, in which case the target PID itself can get skipped over
by the task finder.

Therefore, we should treat failures to get the exe file for a specific
target PID as fatal, since that means the target PID will never get
attached. Note that we must return a negative value from
stap_start_task_finder() in order for the fatal error to be honored, so
we shouldn't negate PTR_ERR(mmpath).

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

testsuite: current.stp module("*") defang

Like for server_concurrency*, the current.stp test case has excessive
debuginfo requirements. We still want -some- decent workload, so
chose usb* as the module wildcard. Far smaller than the "*" there
formerly.

stp_utrace: replace task_utrace_lock with non-blocking RCU read locks

The global task_utrace_lock is highly contented and results in a lot of
CPU time wasted spinning on it, especially since it's not a r/w lock.

It turns out we can replace all of the task_utrace_lock usage with
non-blocking RCU read locks instead to improve performance. Now, reads
to any of the hash list buckets containing the utrace entries do not
block and can occur concurrently with other readers, and writes to any
hash list won't block readers thanks to the magic of RCU. The only
locking needed is between concurrent writes to a single hash list, and
a per-bucket spin lock is used to achieve this instead of a sprawling
global lock.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

man stapprobes.3stap: Document tls context variable support

Merge branch 'scox/tls': Add tls support.

This merges support for accessing implicit tls variables.

Given a DW_OP_GNU_push_tls_address dwarf entry,
tls.stp::__push_tls_address handles navigating the tls data
structures. stp_tls.h contains minimal versions of a few essential
tls data structures.

testsuite: reduce server_concurrency* debuginfo requirements

These tests were using super broad module-name wildcards,
which puts unnecessary stress on debuginfo provision.

NEWS: added an entry for the VMA map RCU lock changes

This is for commit 4b937c5e9.

step-prep: check for debuginfod capability

Test /usr/bin/debuginfod-find for a vdso*so in the kernel. If
successful, avoid downloading big kernel debuginfo files now,
assuming that the debuginfo server(s) will remain available.

Support tls variables on s390

stapconf: adapt to kernel_read_file_from_path API change for v5.10-rc1

The following kernel commit changed the get_user_pages_remote()
function signature:

commit 0fa8e084648779eeb8929ae004301b3acf3bad84
Author: Kees Cook <keescook@chromium.org>
Date:   Fri Oct 2 10:38:25 2020 -0700

   fs/kernel_file_read: Add "offset" arg for partial reads

   To perform partial reads, callers of kernel_read_file*() must have a
   non-NULL file_size argument and a preallocated buffer. The new "offset"
   argument can then be used to seek to specific locations in the file to
   fill the buffer to, at most, "buf_size" per call.

   Where possible, the LSM hooks can report whether a full file has been
   read or not so that the contents can be reasoned about.

This and the preceding commits changed the function signature from:

int kernel_read_file_from_path(const char *path,
       void **buf, loff_t *size, loff_t max_size,
       enum kernel_read_file_id id);

to:

int kernel_read_file_from_path(const char *path, loff_t offset,
       void **buf, size_t buf_size,
       size_t *file_size,
       enum kernel_read_file_id id);

===

XXX kernel commit b89999d004931ab2e51236 also split
kernel_read_file_* functions into a separate header.

As both changes were merged for v5.10-rc1 within the same day,
we detect them with the same autoconf program.

Optimize: increased the default size of task_utrace_table to 1<<8

We also allow overriding this hash table size from the outside via
-DTASK_UTRACE_HASH_BITS=N.

reduce bpf.exp reg_alloc3 verbosity (~70k lines)

More straightfoward -- a testcase that prints something on every
context switch for 5s will be far too verbose.

reduce unprivileged_embedded_C.exp verbosity (~800k lines)

To fight the oobleck of giant test log files (ok to store with Bunsen
but hard to display in a browser without pagination tricks), reduce
the most verbose testcases.

unprivileged_embedded_C.exp bombards us with pass2 output.
(Always printed even without -v flag.)

Options:
1) suppress stap stdout but not stderr with sh -c wrapper + redirection
2) separate and drop stdout in the expect script (not feasible)
3) add a secret --silent-p2 option to stap which drops pass2 output

Let's go with option 1.

testsuite: fix buildok perms

Active testsuite/buildok/*.stp files should be executable.

RHBZ1890702: fix pretty-print conflict with --suppress-time-limits

Codegen for pretty-printed vars couldn't handle the absent
"c->actionremaining" counter.

task_finder_vma: add autoconf check for atomic_fetch_add_unless()

Some kernels have atomic_fetch_add_unless() backported, such as
4.18.0-240.5.el8.aarch64 and 4.18.0-193.28.1.el8_2.x86_64 on rhel8,
so we cannot rely on the kernel version to determine whether or not
the function is present. Add an autoconf stub to check for it.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

Add support for external tls variables

Add tls_module parameter to __push_tls_address to locate external tls variable.
Add errno test.

Optimize: increase the default size of the vma map hash table

Hash conflicts is severe when the vma tracker keeps track of many
processes given the current hash table size of mere 16 buckets.
Increased it to 256 buckets by default and also make it tunable
via the macro __STP_TF_HASH_BITS.

In our test, this significantly reduces hash conflicts when
stap/staprun's -x PID option is not specified.

task_finder_vma: rewrite using RCU to fix performance issues

The use of a single global rwlock to protect this file's hash table
results in significantly degraded performance when there are many
processes using the vma tracker in flight. A lot of time is spent
spinning on the rwlock when this happens. For exmaple, it is using
most of the CPU time in the following kernel-space CPU flame graph:

https://openresty.org/misc/flamegraph/vma-hash-table-spinlock-cpu-flamegraph.png

The middle 3 grey frames with the lable `-` are actually these:

  7a88b0: _raw_spin_lock[0]
  7277: adjustStartLoc[15]
  7277: adjustStartLoc[15]

There are other code paths which would invoke the same spinlock, as in
_stp_umodule_relocate().

To remedy this, make the hash table RCU safe so we'll never block upon
reading a hash list.

We now use the hash_ptr() function to generate the hashes, and the task
pointers themselves are hashed now instead of their PID for reliability,
since PIDs are not a stable anchor point to a task struct.

While we're at it, clean up the rest of this file to bring it up to
current Linux kernel coding standards as well.

This leads to dramatic CPU time reduction when

1. the current system has a lot of running processes, or
2. some processes have a lot of DSO dependencies, and
3. also -x PID is not used for stap or staprun, and
4. there are quite a few CPU cores.

For a typical test run, we have the following CPU utilization changes:

Before: http://openresty.org/download/before-lru-optimization.png
After: http://openresty.org/download/after-lru-optimization.png

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

PR26755 temporary kprobes_onthefly.exp: also disable m* on ppc

XXX Need to investigate which of the tracepoints under m* is failing. XXX
For now, blacklist these in kprobes_onthefly.exp to allow the buildbot
testsuite to finish running.

PR26755 kprobes_onthefly.exp: skip lock_* tracepoints pending investigation

An onthefly testcase started freezing the kernel on the 'hardcore'
case which grabs lock:lock_{acquire,acquired,release,contended} tracepoints.
I'm still tracking down which kernel change caused this
since the tracepoints exist on older codebases (probably lockdep kernels),
but for now it seems that we should avoid probing these in onthefly testing.

Support tls on ppcle

Support tls on ppcle as described in "OpenPOWER ABI for Linux Supplement
Power Architecture 64-Bit ELF V2 ABI." Use glibc debuginfo instead of constant
offset to access link_map->l_tls_modid. Remove set_tls_module_by_addr.

Find module in link_map via module_name

Add regex_module_compare to match various forms of sonames.
Add errno test to tls.exp

Use module_container_of to find kernel header.

@cast() no longer defaults to "kernel" if module is absent so make it explicit.

procfs tapset: compute STP_MAX_PROCFS_FILES

runtime/procfs.c left a plea that the translator should supply an
array-sizing parameter for the runtime's use. So now it does.

PR26697: fix NULL pointer deref in get_utrace_lock()

task_utrace_struct() can return NULL via __task_utrace_struct(). This fixes
the following crash:
BUG: unable to handle kernel NULL pointer dereference at (null)
#9 [ffff8843e56ffd20] get_utrace_lock at ffffffffc08258c6 [stap_X_40544]

The reason why it can return NULL is because engine->ops is protected by
utrace->lock, but we don't have the utrace pointer, and the purpose of
get_utrace_lock() is to get the utrace pointer. Therefore, there's no way
to ensure engine->ops remains unchanged inside get_utrace_lock(), so
get_utrace_lock()'s checks on engine->ops can be incorrect/stale, which
leads to the NULL pointer dereference.

Signed-off-by: Yichun Zhang (agentzh) <yichun@openresty.com>

stap -V: note function with kernel 5.9-rc

PR26665: fix secureboot/mok -> stap-server signing

Previous logic for detecting whether module-signing was required
was broken by the presence of non-systemtap MOK keys.  This led
the client to find no matching stap-servers, leading to no
compilation attempt or MOK assignment.  New code filters local
MOK keys for Systemtap ones only, and even in an absence,
communicates the need for a signature via a "missing" marker.
The server eagerly passes back (new or old) MOK keys to such a
client now.

Tested on a rhel8 uefi/secureboot kvm vm.  Transport /sys/debug
dependencies are still blocking full function due to kernel_lockdown,
but that's coming next.

Add preliminary systemtap tls support

Add tls.stp, a preliminary systemtap tls tapset. Add stp_tls.h, a
censored version of glibc structs needed by the tapset. Add tls.exp,
a simple test.

Doc: documented the @var(NAME, EXE) notation in the beginners guide and langref docs

PR26673: ko gcc compilation error would happen when -DMAXBACKTRACE=N and N > 60

We should allocate the entries[] array in struct context instead of on
the kernel stack.

Also added a test case to cover this case.

Thanks my colleague Junlong Li for the original patch.

man stapprobes.3stap: add @var("var","module") docs

This neat feature was added years ago and mentioned in NEWS.
Now it is documented in the man page too.

tentative fix: update hand-written Makefiles to use KBUILD_EXTMOD=

kernel 5.3+ deprecated and then removed the SUBDIRS=
flag for building an external module. In conjunction with KDIR=,
this led to unpredictable results (i.e. a build failure which
would delete the system kernel-devel package's config files (!)).

Commit 0126be38d9 counsels to use M= or KBUILD_EXTMOD= instead.
No harm in specifying both since newer kernels should ignore SUBDIRS=

PR26660: probe kernel.statement(HEX).absolute incorrectly required kernel debuginfo

Now we pass down the debuginfo_needed argument value to dwflpp instead
of always using its default value (which is true).

PR26658: initscript onboot capability for rhel8 era linuxes

Be able to use /usr/bin/kernel-install vs. /usr/bin/new-kernel-pkg
to update the bootloader ramdisk.

configury: complete auto* regeneration for AM_PROG_AR use

Harmless cleanup/followup from 6188d14a3487fd5.

deprecate STP_TRANSPORT_VERSION=1 (rhel4 relayfs)

The rhel4 era (<2.6.15) relayfs transport hasn't been tested for stap
releases in many years, and appears to have no compelling reason to
keep the code around. Let's remove the code.

Fix get_user_pages{,_remote} for 5.9 kernels.

API fluxes again, dropping task_struct field.

deprecate STP_TRANSPORT_VERSION=3 (ring_buffer)

The non-default ring_buffer transport hasn't even compiled in some
time, and appears to have no compelling reason to keep the code
around. Let's remove the code.

Update emacs/systemtap-mode.el for emacs 27

The emacs cl common lisp compatibility has been deprecated in favor of cl-lib

configure.ac: use AM_PROG_AR to autodetect prefixed 'ar'

Before the change
./configure --host=x86_64-pc-linux-gnu
was using 'ar' tool to generate archives.

After the change it uses x86_64-pc-linux-gnu-ar tool.

It's useful for selecting one of multiple available
prefixed toolchains.

WARNING: Did not regenerate autotools files.

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>

man stap.1: belatedly mention type auto-casting

Way back in 2014, starting with 0fb0cac98ad48, an auto-casting
facility was added.  This allows context-variable derived pointers to
retain their type information as the pointers are moved around inside
systemtap integer variables & expressions.  This was only documented
in the 2.6 release NEWS.  Now it's in the man page too.

stapbpf: fix module name

PR26511: fix probe-condition synthetic begin{} probe compilability

Fix two bugs:

1) In semantic_pass_optimize1, preserve synthetic probes even with
   empty bodies, which would normally be elided.  This allows
   probe-conditions to be more properly initialized, and
   initially-false probes to be disarmed early.

2) In c_unparser::emit_probe(), call emit_lock() based on the
   needs_global_locks() lockworthiness of the proper probe (the
   derived_probe being translated, rather than the other derived_probe
   whose conditions are being evaluated)

This lets the new lock-pushdown.* tests work with or without -u (now
tested).

PR26296: lock pushdown optimization

Implements an algorithm to push lock/unlock operations downward in the
syntax tree, to just enclose the smallest possible region that deals
with global variables.  This means two common patterns run with much more
concurrency than before:

global a
probe foo {
  if (condition)
     { a++ }
  else
     { something_else() }
}

will only lock globals -if- the condition is true, so something_else()
would run unlocked.  Also:

global a
probe foo {
  if (a)
    { long_twisty_operation(); }
}

will unlock globals right after the condition is evaluated, so
long_twisty runs unlocked.  Previous behaviour is avilable with
--compatible=4.3.  New test case lock-pushdown.stp asserts locking
conditions throughout various relevant constructs.

testsuite: have control_limits.stp trigger MAXNESTING as intended

Newer optimization sequencing requires some real work to be
done inside recursive functions to prevent their elision.

Reported-By: Martin Cermak <mcermak@redhat.com>

at_var_unresolved.exp: Adapt to new error msg wording.

PR26392: removed the "stable" flag from unwunding tapset functions

By design, stable function calls should be lightweight ones like pid()
and target(). unwinding functions like ubacktrace() are kinda expensive.
And marking them as stable calls would make them get always evaluated no
matter what.

Thanks fche for the help here.

print formatting fixes for stapbpf rlim_t

Explicit %llu and (unsigned long long) is required on some platforms.

Based on: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=968327

PR26379: Formatting directive type fix.

translator: probe-condition error handling

Ensure generated code for runtime errors in probe condition
expressions have a proper exit label to jump to.

--enable-http: add backward compatibility for pre-MHD_Result libmicrohttpd

Prior to the API-breaking v0.9.71, libmicrohttpd used "int" for
callback results. We have to be conditional with the switch.

Make dtrace generated code work with LTO (take 2)

LTO needs to know which variables might be accessed by code.  In sdt.h
there is a fair amount of assembly code which LTO cannot analyze.  As
a result LTO would assume that the semaphores variables for various
userspace probes were not accessed by the generated code.  The linking
would fail because LTO would optimize away the semphore variables
causing undefined references.

The fix adds the semaphore variables as input operands to the assembly
statements.  This gives the LTO analyzer information that the
variables are used within the assembly statements and should not be
removed.

Fix --enable-http build errors by always using MHD_Result

Return MHD_Result instead of int for: get_key_values,
connection_info::postdataiterator, server::access_handler,
server::queue_response

PR26307: rhel6 porting tweak redux

(So many #if kernel version branches in this part of the code.)

PR26307: rhel6 porting tweak

(So many #if kernel version branches in this part of the code.)

java/HelperSDT.c: correct 32-bit pointer cast warnings

Some casts through (uintptr_t) do the job.

PR26307: adapt to kernel module_sect_attr changes in 5.8+

Linux v5.8 rc5ish, commit ed66f991bb19, introduced a change to the
internal module_sect_attr types.  It caused a kernel BUG: with
pr14546.exp.  This showed it's futile to keep chasing it this
particular way in runtime/transport/symbols.c, so for new enough
kernels (4.7+), a method based on
kernel_read_file_from_path("/sys/module/$MODULE/sections/$SECTION") is
used to extract module section base-addresses.

This is done by a new paper-thin abstraction.  Tested on RHEL7, F32,
rawhide.

systemtap.spec: let -testsuite subrpm require elfutils-debuginfod

.... for sake of the sdt_buildid.exp test case

PR26249: "%p" -> "0x%lx" pointer formatting in *conversions.stp error messages

function::user_string() and function::user_string_warn() in uconversions.stp
were passing an "unsigned long" to the "%p" format-specifier, which upset gcc10.

Switch to 0x%lx. In the process, change other uses of %p to use 0x%lx instead,
changing casts from (void*) to (unsigned long) where necessary.
Fixes the following error on gcc10 (Fedora 32):

    In function ‘function___global_user_string_n_warn__overload_1’:
        /tmp/stapXXXXXX/stap_[...]_src.c: error: format ‘%lx’ expects
        argument of type ‘long unsigned int’, but argument 5 has type
        ‘void *’ [-Werror=format=]

Analogous fixes for kernel_string() etc.

PR25568 / RHBZ1857749: sdt_buildid.exp test case

Add new test that checks for combinations of buildid and pathname
based uprobes for executables and shared libraries.