Bug 23493 - Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)
Summary: Test suite makes all CPU stuck forever on kernel 4.16.16 (Fedora 27)
Status: RESOLVED OBSOLETE
Alias: None
Product: systemtap
Classification: Unclassified
Component: testsuite (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-07 17:23 UTC by agentzh
Modified: 2023-05-25 16:35 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
stap-report output on the same box (12.04 KB, text/plain)
2018-08-07 17:42 UTC, agentzh
Details
Full /var/log/messages for the full period of test running (10.22 KB, text/plain)
2018-08-10 18:41 UTC, agentzh
Details
Full output of the command `sudo time make -j8 installcheck-parallel |& tee result.txt` (15.39 KB, text/plain)
2018-08-10 18:43 UTC, agentzh
Details
Full output of 'sudo make -j4 installcheck-parallel' for the find_get_pid kernel oops run (4.20 KB, text/plain)
2018-08-29 18:02 UTC, agentzh
Details

Note You need to log in before you can comment on or make changes to this bug.
Description agentzh 2018-08-07 17:23:27 UTC
I tried running the tests for systemtap release 3.3 (both the tests and the source code were from that release) on my Fedora 27 with kernel 2.6.16, but after running the tests for one hour, all the 8 logical CPU cores stuck for more than 8 hours without going any further and I had to reboot the VM with force (even ssh cannot work).

Below is all the error messages after the stuck happens:

Aug  7 00:27:28 work2 kernel: Systemtap Error at _stp_mempool_init:50
Aug  7 00:27:28 work2 kernel: Memory allocation failed.
Aug  7 00:27:28 work2 kernel: Systemtap Error at _stp_register_ctl_channel:738
Aug  7 00:27:28 work2 kernel: Error creating systemtap control channel.
Aug  7 00:31:29 work2 dhclient[1149]: DHCPREQUEST on ens36 to 192.168.46.254 port 67 (xid=0x2772f551)
Aug  7 00:31:29 work2 dhclient[1149]: DHCPACK from 192.168.46.254 (xid=0x2772f551)
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6530] dhcp4 (ens36):   address 192.168.46.170
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4 (ens36):   plen 24 (255.255.255.0)
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4 (ens36):   lease time 1800
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4 (ens36):   nameserver '192.168.46.1'
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4 (ens36):   domain name 'localdomain'
Aug  7 00:31:29 work2 dbus-daemon[724]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.5' (uid=0 pid=839 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Aug  7 00:31:29 work2 NetworkManager[839]: <info>  [1533627089.6533] dhcp4 (ens36): state changed bound -> bound
Aug  7 00:31:29 work2 dhclient[1149]: bound to 192.168.46.170 -- renewal in 683 seconds.
Aug  7 00:31:29 work2 systemd[1]: Starting Network Manager Script Dispatcher Service...
Aug  7 00:31:29 work2 dbus-daemon[724]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Aug  7 00:31:29 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:31:29 work2 systemd[1]: Started Network Manager Script Dispatcher Service.
Aug  7 00:31:29 work2 nm-dispatcher[47712]: req:1 'dhcp4-change' [ens36]: new request (6 scripts)
Aug  7 00:31:29 work2 nm-dispatcher[47712]: req:1 'dhcp4-change' [ens36]: start running ordered scripts...
Aug  7 00:31:40 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:33 work2 audit[59170]: ANOM_ABEND auid=1000 uid=0 gid=0 ses=3 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=59170 comm="test" exe="/mnt/home/agentzh/git/stap-build/testsuite/artifacts/systemtap.base/pr18649/test" sig=11 res=1
Aug  7 00:32:33 work2 kernel: test[59170]: segfault at 0 ip 0000000000000000 sp 00000000fff42e30 error 14 in test[8048000+1000]
Aug  7 00:32:33 work2 systemd[1]: Started Process Core Dump (PID 59186/UID 0).
Aug  7 00:32:33 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@5-59186-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:34 work2 systemd-coredump[59191]: Process 59170 (test) of user 0 dumped core.#012#012Stack trace of thread 59170:#012#0  0x0000000000000000 n/a (n/a)
Aug  7 00:32:34 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@5-59186-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:45 work2 kernel: test[62368]: segfault at 0 ip 0000000000000000 sp 00000000ffb53270 error 14 in test[8048000+1000]
Aug  7 00:32:45 work2 audit[62368]: ANOM_ABEND auid=1000 uid=0 gid=0 ses=3 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=62368 comm="test" exe="/mnt/home/agentzh/git/stap-build/testsuite/artifacts/systemtap.base/pr18649/test" sig=11 res=1
Aug  7 00:32:45 work2 systemd[1]: Started Process Core Dump (PID 62441/UID 0).
Aug  7 00:32:45 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@6-62441-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:46 work2 systemd-coredump[62445]: Process 62368 (test) of user 0 dumped core.#012#012Stack trace of thread 62368:#012#0  0x0000000000000000 n/a (n/a)
Aug  7 00:32:46 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@6-62441-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:51 work2 audit[64501]: ANOM_ABEND auid=1000 uid=0 gid=0 ses=3 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=64501 comm="test" exe="/mnt/home/agentzh/git/stap-build/testsuite/artifacts/systemtap.base/uprobe_nd_params/test" sig=11 res=1
Aug  7 00:32:51 work2 kernel: test[64501]: segfault at 0 ip 0000000000000000 sp 00000000ff877e2c error 14 in test[8048000+1000]
Aug  7 00:32:51 work2 systemd[1]: Started Process Core Dump (PID 64521/UID 0).
Aug  7 00:32:51 work2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@7-64521-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:32:52 work2 systemd-coredump[64529]: Process 64501 (test) of user 0 dumped core.#012#012Stack trace of thread 64501:#012#0  0x0000000000000000 n/a (n/a)
Aug  7 00:32:52 work2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@7-64521-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug  7 00:33:25 work2 stapio[65395]: WARNING:
Aug  7 00:33:25 work2 stapio[65395]:
Aug  7 00:33:25 work2 stapio[65395]: Number of errors: 0, skipped probes: 23
Aug  7 00:34:16 work2 kernel: stap_e0ad497a0eaf9de8589606c16f0b585_79159: unknown parameter 'pt' ignored

And below is the last bunch of lines in the test run output.

Running target unix
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /mnt/home/agentzh/git/systemtap/testsuite/config/unix.exp as tool-and-target-specific interface file.
Test run by root on Tue Aug  7 00:35:53 2018
Native configuration is x86_64-pc-linux-gnu

        === systemtap tests ===

Running /mnt/home/agentzh/git/systemtap/testsuite/systemtap.base/kallsyms_expand_symbol.exp ...
Schedule of variations:
    unix

Running target unix
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /mnt/home/agentzh/git/systemtap/testsuite/config/unix.exp as tool-and-target-specific interface file.
Running /mnt/home/agentzh/git/systemtap/testsuite/systemtap.base/perf.exp ...

        === systemtap Summary ===

# of expected passes        2
make[3]: Leaving directory '/mnt/home/agentzh/git/stap-build/testsuite'
if test -e artifacts/systemtap.base/sdt_varname ; then \
echo "Using existing test result for systemtap.base/sdt_varname.exp"; \
else \
mkdir -p artifacts/systemtap.base/sdt_varname ; \
MAKEFLAGS= make  check-DEJAGNU RUNTESTFLAGS="--outdir=artifacts/systemtap.base/sdt_varname systemtap.base/sdt_varname.exp  --tool_opts \'install \'" STAP_PARALLEL=yes; \
fi
make[3]: Entering directory '/mnt/home/agentzh/git/stap-build/testsuite'
srcdir='../../systemtap/testsuite'; export srcdir; \
EXPECT=expect; export EXPECT; \
if /bin/sh -c "env XDG_DATA_DIRS= SYSTEMTAP_SYNC=1 LANG=C SYSTEMTAP_TESTREMOTES= SYSTEMTAP_TESTAPPS= SYSTEMTAP_RUNTIME=/opt/stap/share/systemtap/runtime SYSTEMTAP_TAPSET=/opt/stap/share/systemtap/tapset LD_LIBRARY_PATH=/opt/stap/lib/systemtap:/opt/stap/lib/dyninst CRASH_LIBDIR=/opt/stap/lib/systemtap PATH=/opt/stap/bin:$PATH SYSTEMTAP_PATH=/opt/stap/bin SYSTEMTAP_INCLUDES=/opt/stap/include PKGLIBDIR=/opt/stap/libexec/systemtap PYTHON=/bin/python2 PYEXECDIR=/opt/stap/lib64/python2.7/site-packages PYTHON3=/bin/python3 PY3EXECDIR=/opt/stap/lib64/python3.6/site-packages SYSCONFDIR=/opt/stap/etc ../../systemtap/testsuite/execrc runtest --version" > /dev/null 2>&1; then \
  exit_status=0; l='systemtap'; for tool in $l; do \
    if env XDG_DATA_DIRS= SYSTEMTAP_SYNC=1 LANG=C SYSTEMTAP_TESTREMOTES= SYSTEMTAP_TESTAPPS= SYSTEMTAP_RUNTIME=/opt/stap/share/systemtap/runtime SYSTEMTAP_TAPSET=/opt/stap/share/systemtap/tapset LD_LIBRARY_PATH=/opt/stap/lib/systemtap:/opt/stap/lib/dyninst CRASH_LIBDIR=/opt/stap/lib/systemtap PATH=/opt/stap/bin:$PATH SYSTEMTAP_PATH=/opt/stap/bin SYSTEMTAP_INCLUDES=/opt/stap/include PKGLIBDIR=/opt/stap/libexec/systemtap PYTHON=/bin/python2 PYEXECDIR=/opt/stap/lib64/python2.7/site-packages PYTHON3=/bin/python3 PY3EXECDIR=/opt/stap/lib64/python3.6/site-packages SYSCONFDIR=/opt/stap/etc ../../systemtap/testsuite/execrc runtest  --tool $tool --tool_opts \'\' --srcdir /mnt/home/agentzh/git/systemtap/testsuite --outdir=artifacts/systemtap.base/sdt_varname systemtap.base/sdt_varname.exp  --tool_opts \'install \'; \
    then :; else exit_status=1; fi; \
  done; \
else echo "WARNING: could not find 'env XDG_DATA_DIRS= SYSTEMTAP_SYNC=1 LANG=C SYSTEMTAP_TESTREMOTES= SYSTEMTAP_TESTAPPS= SYSTEMTAP_RUNTIME=/opt/stap/share/systemtap/runtime SYSTEMTAP_TAPSET=/opt/stap/share/systemtap/tapset LD_LIBRARY_PATH=/opt/stap/lib/systemtap:/opt/stap/lib/dyninst CRASH_LIBDIR=/opt/stap/lib/systemtap PATH=/opt/stap/bin:$PATH SYSTEMTAP_PATH=/opt/stap/bin SYSTEMTAP_INCLUDES=/opt/stap/include PKGLIBDIR=/opt/stap/libexec/systemtap PYTHON=/bin/python2 PYEXECDIR=/opt/stap/lib64/python2.7/site-packages PYTHON3=/bin/python3 PY3EXECDIR=/opt/stap/lib64/python3.6/site-packages SYSCONFDIR=/opt/stap/etc ../../systemtap/testsuite/execrc runtest'" 1>&2; :;\
fi; \
exit $exit_status
WARNING: Couldn't find the global config file.
Checking for 64-bit support...
spawn -ignore SIGHUP gcc hello.c -g -m64 -lm -o hello-m64^M
x86_64 64-bit support available
Checking for 32-bit support...
spawn -ignore SIGHUP gcc hello.c -g -m32 -lm -o hello-m32^M
x86_64 32-bit support available
kernel location: /usr/lib/debug/lib/modules/4.16.16-200.fc27.x86_64/vmlinux
kernel version: 4.16.16-200.fc27.x86_64
systemtap location: /opt/stap/bin/stap
systemtap version: version 3.3/0.173, commit release-3.3-0-g48867d1cface
gcc location: /usr/bin/gcc
gcc version: gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
Test run by root on Tue Aug  7 00:35:57 2018
Native configuration is x86_64-pc-linux-gnu

        === systemtap tests ===

Schedule of variations:
    unix

Running target unix
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /mnt/home/agentzh/git/systemtap/testsuite/config/unix.exp as tool-and-target-specific interface file.
Running /mnt/home/agentzh/git/systemtap/testsuite/systemtap.base/sdt_varname.exp ...
Comment 1 agentzh 2018-08-07 17:26:05 UTC
Sorry, the kernel version is 4.16.16:

$ uname -r
4.16.16-200.fc27.x86_64
Comment 2 agentzh 2018-08-07 17:42:17 UTC
Created attachment 11167 [details]
stap-report output on the same box
Comment 3 Frank Ch. Eigler 2018-08-08 23:03:46 UTC
Two more bits of data should help identify the problem:

- a segment of the dmesg from -before- the crash, in order to figure out which scripts were running most recently or at the time

- the exact make-check type invocation you were using (a parallel-installcheck?)
Comment 4 agentzh 2018-08-10 18:41:22 UTC
Created attachment 11175 [details]
Full /var/log/messages for the full period of test running
Comment 5 agentzh 2018-08-10 18:43:12 UTC
Created attachment 11176 [details]
Full output of the command `sudo time make -j8 installcheck-parallel |& tee result.txt`
Comment 6 agentzh 2018-08-10 18:50:05 UTC
The command I used to run the stap test suite was this:

```
sudo time make -j8 installcheck-parallel |& tee result.txt
```

I've attached the full `result.txt` produced that night to this PR for your reference.

I've also attached the full /var/log/messages file content during the test run period (also included messages right before I started running the stap test suite and right after I rebooted the machine on the next day to avoid any confusions).

Some observations from the result.txt and /var/log/messages files:

1. Aug 6 23:18 I started running the test with the command `sudo time make -j8 installcheck-parallel |& tee result.txt` (see result.txt)

2. Aug 7 00:35:57 or later: the stap test suite generates the last few lines of output to stderr/stdout (see result.txt)

3. Aug  7 00:34:16 The kernel generated the last line of output to /var/log/messages, which was "work2 kernel: stap_e0ad497a0eaf9de8589606c16f0b585_79159: unknown parameter 'pt' ignored" After that, no kernel dmesg output until I rebooted the machine with force.

4. Aug  7 08:28:03 I got up in the morning and rebooted the machine with force.

The virtual machine running the tests has 8GB of RAM:

```
$ free
              total        used        free      shared  buff/cache   available
Mem:        8143832      646244     1851164       98824     5646424     6997500
Swap:       2097148           0     2097148
```

It has 8 logical CPU cores (4 physical CPU cores with hyperthreading enabled).

Please let me know if you need any further information. Thanks!
Comment 7 agentzh 2018-08-26 06:26:48 UTC
It seems like that the stuck usually happens after running `uprobes_onthefly.exp` and/or `kprobes_onthefly.exp`. By checking the `/proc/buddyinfo` output during the nearby time window, the fragmentation becomes dramatically worse, for example:

```
$ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      2      1      1      0      1      1      3
Node 0, zone    DMA32     14     23     27     27     23     12     17     16      9      4    739
Node 0, zone   Normal    379   4113   3506   3803   2461    733    221      0      0      0      0

$ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      2      1      1      0      1      1      3
Node 0, zone    DMA32      5      5     10     20     15      5     16     16      9      4    700
Node 0, zone   Normal    344   1273   1414   1396   1047    473    143      0      0      0      0
```

Even though there's a lot of free memory in the system at the same times:

```
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          11700        4897        3510          76        3293        5702
Swap:          2047           0        2047
```

I'm thinking about enforcing VM compaction every few minutes during the test suite runs as a work around...
Comment 8 agentzh 2018-08-29 01:20:15 UTC
Got another VM lockup today and found the following output in the serial port console:

```
Fedora 27 (Server Edition)
Kernel 4.16.16-200.fc27.x86_64 on an x86_64 (ttyS1)

Admin Console: …

work2 login: [199810.093706] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[199810.098304] IP: idr_find+0x0/0x10
[199810.099452] PGD 0 P4D 0
[199810.100545] Oops: 0000 [#1] SMP PTI
[199810.101868] Modules linked in: stap_90d3fbe18ef964786450622403aff75f_53232(OE) stap_cb1c82c1a520c7e28bc23ff0583b723_26015(OE) stap_5546859fb0fb1fe0358665f08f8539_102168(OE) stap_1699a9c06989cded3b2bfdebd62fd6e9_7693(OE) s_97966(OE) tun binfmt_misc ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 fuse nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables vmw_vsock_vmci_transport vsock sunrpc snd_seq_midi snd_seq_midi_event snd_ens1371 coretemp crct10dif_pclmul snd_ac97_codec ac97_bus btusb gameport crc32_pclmul btrtl
[199810.124418]  snd_rawmidi btbcm ghash_clmulni_intel snd_seq btintel snd_seq_device bluetooth snd_pcm snd_timer snd ecdh_generic rfkill soundcore intel_rapl_perf joydev vmw_balloon i2c_piix4 vmw_vmci shpchp xfs libcrc32c vmwgfx drm_kms_helper ttm crc32c_intel drm mptspi serio_raw scsi_transport_spi mptscsih e1000 mptbase ata_generic pata_acpi [last unloaded: stap_c0bb131f68fd6d96fe20bf726877cbfd_51549]
[199810.135177] CPU: 2 PID: 53244 Comm: cat Tainted: G           OE    4.16.16-200.fc27.x86_64 #1
[199810.137821] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[199810.141336] RIP: 0010:idr_find+0x0/0x10
[199810.142583] RSP: 0018:ffffbeddc3423d40 EFLAGS: 00010046
[199810.144310] RAX: 0000000000000000 RBX: ffffbeddc2191000 RCX: 00000000ffffffff
[199810.146450] RDX: 0000000000000000 RSI: 000000000000cffc RDI: 0000000000000008
[199810.148760] RBP: ffffbeddc21918a8 R08: 0000000000002710 R09: ffff9e8127a01f00
[199810.150956] R10: 0000000000002710 R11: 0000000000000000 R12: 0000000000000001
[199810.153205] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffffffff
[199810.155436] FS:  0000000000000000(0000) GS:ffff9e825fa80000(0000) knlGS:0000000000000000
[199810.157981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[199810.212544] CR2: 0000000000000018 CR3: 000000024120a001 CR4: 00000000001606e0
[199810.215011] DR0: ffffffff8427ff88 DR1: 0000000000000000 DR2: 0000000000000000
[199810.217433] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[199810.219853] Call Trace:
[199810.233902]  find_get_pid+0x38/0x50
[199810.235436]  function___global_pid2task__overload_0+0x7a/0x180 [stap_90d3fbe18ef964786450622403aff75f_53232]
[199810.239019]  ? probe_4960+0xad6/0x1760 [stap_90d3fbe18ef964786450622403aff75f_53232]
[199810.241607]  ? enter_real_tracepoint_probe_0+0x198/0x2d0 [stap_90d3fbe18ef964786450622403aff75f_53232]
[199810.245162]  ? __schedule+0x2d3/0x860
[199810.246718]  ? do_task_dead+0x3f/0x50
[199810.248254]  ? do_exit+0x73b/0xbb0
[199810.249684]  ? do_group_exit+0x3a/0xa0
[199810.251200]  ? SyS_exit_group+0x10/0x10
[199810.252711]  ? do_syscall_64+0x74/0x180
[199810.254238]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[199810.256126] Code: e7 44 89 6c 24 04 e8 00 fe ff ff eb af e8 09 b9 80 ff 66 0f 1f 84 00 00 00 00 00 8b 47 10 31 d2 48 29 c6 e9 c3 55 00 00 0f 1f 00 <8b> 47 10 48 29 c6 e9 a5 55 00 00 0f 1f 44 00 00 55 53 48 89 f5
[199810.262366] RIP: idr_find+0x0/0x10 RSP: ffffbeddc3423d40
[199810.264261] CR2: 0000000000000018
[199810.265642] ---[ end trace 4e992f315e802219 ]---
[199810.267325] Fixing recursive fault but reboot is needed!
```

Is it a bug in the stap runtime or in the kernel?
Comment 9 Mark Wielaard 2018-08-29 12:59:22 UTC
(In reply to agentzh from comment #8)
> Is it a bug in the stap runtime or in the kernel?

It looks like this is caused by calling the kernel find_get_pid () function from the pid2task tapset function when in kernel mode, just after the current task died? I am not sure find_get_pid should/can be called from a kernel context. It might assume the current pid actually exists? Maybe current is NULL at this point and we should check for that first?

Do you have more context about which probe was hit when this happened?
Comment 10 agentzh 2018-08-29 17:47:19 UTC
(In reply to Mark Wielaard from comment #9)
> 
> Do you have more context about which probe was hit when this happened?

Hi Mark, thanks for your reply. I tried hard to pinpoint that stap_90d3fbe18ef964786450622403aff75f_53232 kernel module after rebooting the system with force but it seems that this ko is long gone and there is no record for this thing in the test suite log files either. Anyway, I'll provide the full test suite output file in an attachment here.
Comment 11 agentzh 2018-08-29 18:02:06 UTC
Created attachment 11215 [details]
Full output of 'sudo make -j4 installcheck-parallel' for the find_get_pid kernel oops run
Comment 12 Frank Ch. Eigler 2020-02-20 14:26:50 UTC
(In reply to Mark Wielaard from comment #9)
> It might assume the current pid actually exists? Maybe
> current is NULL at this point and we should check for that first?

current is nonzero (I think by definition), but
task_active_pid_ns(current) (called within find_get_pid->find_vpid)
looks like it was 0.

So when we get called for a schedule tracepoint, the invoking task
might just be a mostly-dead one which doesn't even have a pid any more.
I'm not sure why this should occur only under heavy load
(installcheck-parallel), vs. all the time.  Maybe task garbage collection
occurs more in the former case.

I'm tempted to put in a probe-prologue detection of this
(maybe via (current->flags & PF_EXITING)) and reject the probe
hit entirely.
Comment 13 agentzh 2020-02-20 20:41:05 UTC
fche, thanks for the suggestion! We will try doing such experiments on our side when we have a chance.
Comment 14 Frank Ch. Eigler 2023-05-25 16:35:27 UTC
old news