Bug 31006 - systemtap.contex/context.exp test causes reboot on RHEL8 ppc64le
Summary: systemtap.contex/context.exp test causes reboot on RHEL8 ppc64le
Status: RESOLVED WORKSFORME
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-30 00:13 UTC by William Cohen
Modified: 2023-10-30 07:37 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description William Cohen 2023-10-30 00:13:08 UTC
When running the tests on various machine with the latest git checkout of systemtap I found that the following test was causing reboots on RHEL8 machines:

  make installcheck RUNTESTFLAGS="systemtap.context/context.exp"

The machine is running the following kernel:
# rpm -q kernel
kernel-4.18.0-372.78.1.el8_6.ppc64le

See the tail end of the "make installcheck ..." test fail with:

Running /root/systemtap_write/systemtap/testsuite/systemtap.context/context.exp ...
FAIL: early test shutdown (m1 eof)
FAIL: print_syms of yyy_func2 (0)
FAIL: backtrace of yyy_func3 (0)
FAIL: print_syms of yyy_func3 (0)
FAIL: backtrace of yyy_func4 (0)
FAIL: print_syms of yyy_func4 (0)
FAIL: print_syms didn't find systemtap_test_module1 (1)
FAIL: print_syms didn't find [kernel] (0)
FAIL: backtrace.stp didn't call exit (0)


Then see following fault on the console and the machine reboots:


[  456.849576] systemtap_test_module2: loading out-of-tree module taints kernel. 
[  456.851419] systemtap_test_module2: module verification failed: signature and/or required key missing - tainting kernel 
[  492.972999] Kprobes globally unoptimized 
[  506.452744] Unable to handle kernel paging request for data at address 0xd00000000213fffc 
[  506.452756] Faulting instruction address: 0xc00000000005f60c 
[  506.452761] Oops: Kernel access of bad area, sig: 11 [#1] 
[  506.452764] LE SMP NR_CPUS=2048 NUMA pSeries 
[  506.452771] Modules linked in: stap_930f8f23a61453c40ca1c17d02dbb00b__7993(OE) systemtap_test_module1(OE) systemtap_test_module2(OE) bonding tls xts pseries_rng uio_pdrv_genirq uio vmx_crypto xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_0faf39736bc24bb3175d9e070d8b7d3e_7449] 
[  506.452801] CPU: 0 PID: 7993 Comm: stapio Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.78.1.el8_6.ppc64le #1 
[  506.452808] NIP:  c00000000005f60c LR: c0000000002e2a28 CTR: 0000000000000000 
[  506.452813] REGS: c000000005aa7790 TRAP: 0300   Tainted: G           OE    --------- -  -  (4.18.0-372.78.1.el8_6.ppc64le) 
[  506.452818] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24008824  XER: 00000000 
[  506.452834] CFAR: c00000000005f604 DAR: d00000000213fffc DSISR: 40000000 IRQMASK: 0  
[  506.452834] GPR00: c0000000002e2a28 c000000005aa7a20 c000000002174900 d000000001faa4a0  
[  506.452834] GPR04: d000000002140000 d000000002140820 c0000000021a3c00 c000000007bac000  
[  506.452834] GPR08: c000000007bac000 d000000002140000 000000000000001f 0000000000000000  
[  506.452834] GPR12: 0000000000002000 c000000002e00000 0000000010015090 0000000010015398  
[  506.452834] GPR16: 00000000100153b8 00000000100153c0 0000000000000001 0000000000000000  
[  506.452834] GPR20: 0000000010014ef0 0000000000000000 c0000000021b2e88 d000000001fa9b18  
[  506.452834] GPR24: c000000001c70748 d000000002140680 d000000001faa4b0 c000000002493580  
[  506.452834] GPR28: c000000002019460 00000000000000ff d000000001faa4a0 000000007c0802a6  
[  506.452892] NIP [c00000000005f60c] arch_prepare_kprobe+0x4c/0x1f0 
[  506.452899] LR [c0000000002e2a28] register_kprobe+0x798/0x9b0 
[  506.452904] Call Trace: 
[  506.452907] [c000000005aa7a20] [c000000005aa7a60] 0xc000000005aa7a60 (unreliable) 
[  506.452913] [c000000005aa7aa0] [c0000000002e2a28] register_kprobe+0x798/0x9b0 
[  506.452919] [c000000005aa7b10] [c0000000002e2dc0] register_kretprobe.part.16+0xc0/0x3d0 
[  506.452926] [c000000005aa7b60] [c0000000002e3428] register_kretprobe+0x128/0x190 
[  506.452932] [c000000005aa7ba0] [d000000001f921c0] stapkp_register_probe+0xe8/0x1f0 [stap_930f8f23a61453c40ca1c17d02dbb00b__7993] 
[  506.452940] [c000000005aa7be0] [d000000001f9b2ec] systemtap_module_init+0x8d4/0xac0 [stap_930f8f23a61453c40ca1c17d02dbb00b__7993] 
[  506.452949] [c000000005aa7cc0] [d000000001f9b6d8] _stp_ctl_write_cmd+0x200/0x10f0 [stap_930f8f23a61453c40ca1c17d02dbb00b__7993] 
[  506.452958] [c000000005aa7d90] [c000000000659360] proc_reg_write+0x90/0x1a0 
[  506.452964] [c000000005aa7dc0] [c000000000573ae4] sys_write+0x134/0x3a0 
[  506.452971] [c000000005aa7e20] [c00000000000b408] system_call+0x5c/0x70 
[  506.452977] Instruction dump: 
[  506.452980] fbc1fff0 fbe1fff8 f8010010 f821ff81 3ba000ff e9230028 7c7e1b78 83e90000  
[  506.452991] 57ea36be 2f8a0001 409e0008 83a90004 <8149fffc> 7fe7fb78 554a36be 2b8a0001  
[  506.453003] ---[ end trace 21503c792de92f64 ]--- 
[  506.460194]  
[  506.460201] Sending IPI to other CPUs 
[  506.461497] IPI complete 
[  506.482748] kexec: Starting switchover sequence. 
I'm in purgatory 


I attempted to replicate the problem using perf kprobe based events described by https://www.kernel.org/doc/html/v4.19/trace/kprobetrace.html on the test_module2 yyy.
The perf probe was triggered events but did not crash the machine.  The yyy function is at the very beginning of the module. There seems to be an access before the the start of module that is triggering the problem.

#  addr2line -ie /usr/lib/debug/lib/modules/4.18.0-372.78.1.el8_6.ppc64le/vmlinux 0xc00000000005f60c
/usr/src/debug/kernel-4.18.0-372.78.1.el8_6/linux-4.18.0-372.78.1.el8_6.ppc64le/./arch/powerpc/include/asm/inst.h:54
/usr/src/debug/kernel-4.18.0-372.78.1.el8_6/linux-4.18.0-372.78.1.el8_6.ppc64le/arch/powerpc/kernel/kprobes.c:124

The code in arch_prepare_kprobe(struct kprobe *p) appears to be accessing before the beginning of the function:

int arch_prepare_kprobe(struct kprobe *p)
{
        int ret = 0;
        struct kprobe *prev;
        struct ppc_inst insn = ppc_inst_read((struct ppc_inst *)p->addr);
        struct ppc_inst prefix = ppc_inst_read((struct ppc_inst *)(p->addr - 1)); //<---


I wasn't able to replicate this on rhel9 as the test does not currently compile because changes in a function that is called.
Comment 1 William Cohen 2023-10-30 00:37:21 UTC
Looking at the /sys/kernel/debug/tracing/trace I see why now it didn' trigger the problem as the probe was put on yyy_int+0x4 rather than yyy_int+0x0 as systemtap is doing:

echo 'p:myprobe systemtap_test_module2:yyy_int $comm' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
echo  1 > /proc/stap_test_cmd 
echo  1 > /proc/stap_test_cmd

# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            bash-1976  [006] ....  1525.064248: myprobe: (yyy_int+0x4/0x20 [systemtap_test_module2]) arg1="bash"
            bash-1976  [006] ....  1617.000845: myprobe: (yyy_int+0x4/0x20 [systemtap_test_module2]) arg1="bash"
Comment 2 William Cohen 2023-10-30 01:32:20 UTC
Did some digging around and found that this has been fixed in the kernel-4.18.0-520.el8.ppc64le kernel.  With that kernel the test just fails rather than causing the system to reboot:

unning /usr/share/systemtap/testsuite/systemtap.context/context.exp ...
FAIL: backtrace - could not build modules
FAIL: args - could not build modules
FAIL: pid - could not build modules
FAIL: num_args - could not build modules
FAIL: symfileline - could not build modules

		=== systemtap Summary ===

# of unexpected failures	5

Looking through the red hat issue looks like there might also be some RHEL9 kernels version that get similar oops (RHELPLAN-142467) and appears to be fixed in RHEL9.2.
Comment 3 William Cohen 2023-10-30 01:33:10 UTC
RHELPLAN-163851 is the RHEL issue that describe the RHEL8.9 kernel issue.
Comment 4 Frank Ch. Eigler 2023-10-30 07:37:27 UTC
(tweaking status field to indicate that it was not a systemtap bug)