Summary: | uprobes_onthefly.exp causing hang on ppc64 | ||
---|---|---|---|
Product: | systemtap | Reporter: | David Smith <dsmith> |
Component: | runtime | Assignee: | Unassigned <systemtap> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | mcermak |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
stap-report output
test log |
Description
David Smith
2014-08-14 19:43:40 UTC
One of the things the test does is turn on on-the-fly debugging using stap's '-D DEBUG_STP_ON_THE_FLY' flag. Without this flag on, I get a different behaviour. The tests fail of course, but the test runs until the system hangs. Here's the console output: ==== [ 260.667249] stap_dcf37cae5885f9b67f9894a335ed6337_3602: module verification failed: signature and/or required key missing - tainting kernel [ 347.786372] stap module notifier triggered in unexpected state 3 [ 356.786209] stap module notifier triggered in unexpected state 3 [ 362.386105] stap module notifier triggered in unexpected state 3 [ 440.444905] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 1 6 7} (detected by 3, t=6002 jiffies, g=1634, c=1633, q=0) [ 440.445177] Task dump for CPU 0: [ 440.445267] swapper/0 R running task 0 0 0 0x00000000 [ 440.445362] Call Trace: [ 440.445403] [c000000001283820] [c00000000011d6c8] .idle_balance+0x128/0x240 (unreliable) [ 440.445521] [c0000000012839f0] [c0000000000a4194] .p970_compute_mmcr+0x8d4/0x9f0 [ 440.445630] Task dump for CPU 1: [ 440.445679] swapper/1 R running task 0 0 1 0x00000800 [ 440.445775] Call Trace: [ 440.445811] [c00000022df6b870] [c0000000004511c4] .cpumask_next_and+0x94/0xa0 (unreliable) [ 440.445929] [c00000022df6ba40] [c0000000000a4194] .p970_compute_mmcr+0x8d4/0x9f0 [ 440.446039] Task dump for CPU 6: [ 440.446085] swapper/6 R running task 0 0 1 0x00000800 [ 440.446180] Call Trace: [ 440.446215] [c00000022df7f870] [c0000000004511c4] .cpumask_next_and+0x94/0xa0 (unreliable) [ 440.446330] [c00000022df7fa40] [c0000000007683cc] .cmsghdr_from_user_compat_to_kern+0x5c/0x350 [ 440.446452] Task dump for CPU 7: [ 440.446503] swapper/7 R running task 0 0 1 0x00000800 [ 440.446595] Call Trace: [ 440.446630] [c00000022c003870] [c0000000008ad548] .__schedule+0x898/0xae0 (unreliable) [ 440.446745] [c00000022c003a40] [c0000000000a4194] .p970_compute_mmcr+0x8d4/0x9f0 ==== The uprobes_onthefly.exp testcase is composed of several subtests. All the tests run with 'run_subtest_valid' work fine: otf_finish_at_start_disabled, otf_finish_at_start_enabled, otf_start_disabled_iter_1, otf_start_enabled_iter_1, otf_start_disabled_iter_2, otf_start_enabled_iter_2, otf_start_disabled_iter_3, otf_start_enabled_iter_3, otf_start_disabled_iter_4, otf_start_enabled_iter_4, otf_start_disabled_iter_5, otf_start_enabled_iter_5, otf_timer_100ms, and otf_timer_50ms The stress tests (run with 'run_subtset_stress') are where the problem lies. Here is what happens when you run each individually: - otf_stress_10ms_iter_50: passes - otf_stress_5ms_iter_50: passes - otf_stress_1ms_iter_50: passes - otf_stress_500us_iter_50: passes, but causes the "stap module notifier triggered in unexpected state 3" message on the console - otf_stress_100us_iter_50: ditto - otf_stress_prof_iter_2000: ditto - otf_stress_hard_iter_2000: hangs the system with the following on the console: [ 1902.414132] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 2 5 6} (detected by 4, t=6002 jiffies, g=4654, c=4653, q=0) [ 1902.414422] Task dump for CPU 0: [ 1902.414471] swapper/0 R running task 0 0 0 0x00000000 [ 1902.414570] Call Trace: [ 1902.414616] [c000000001283820] [c00000000046cc78] .find_next_bit+0x38/0xf0 (unreliable) [ 1902.414734] [c0000000012839f0] [c0000000000a4194] .p970_compute_mmcr+0x8d4/0x9f0 [ 1902.414844] Task dump for CPU 2: [ 1902.414889] swapper/2 R running task 0 0 1 0x00000800 [ 1902.414986] Call Trace: [ 1902.415022] [c00000022df6f870] [c0000000008ad580] .__schedule+0x8d0/0xae0 (unreliable) [ 1902.415131] [c00000022df6fa40] [c0000000000a4194] .p970_compute_mmcr+0x8d4/0x9f0 [ 1902.415243] Task dump for CPU 5: [ 1902.415290] swapper/5 R running task 0 0 1 0x00000800 [ 1902.415387] Call Trace: [ 1902.415429] [c00000022df7b870] [000001ac274bfdaa] 0x1ac274bfdaa (unreliable) [ 1902.415538] [c00000022df7ba40] [c0000000000a4194] .p970_compute_mmcr+0x8d4/0x9f0 [ 1902.415647] Task dump for CPU 6: [ 1902.415693] swapper/6 R running task 0 0 1 0x00000800 [ 1902.415785] Call Trace: [ 1902.415818] [c00000022df7f870] [c0000000008ad574] .__schedule+0x8c4/0xae0 (unreliable) [ 1902.415925] [c00000022df7fa40] [c0000000000a4194] .p970_compute_mmcr+0x8d4/0x9f0 - otf_stress_max_iter_5000: causes a system hang with no output on the console (BTW, the unexpected state 3 in the "stap module notifier triggered in unexpected state 3" message is STAP_SESSION_STOPPING.) OK, I think I know what is going on here. I think there are actually 2 separate problems here: 1) The "stap module notifier triggered in unexpected state 3" message happens when the session is shutting down but the uprobe hasn't been unregistered yet. I think we can change the code to not output this message if we in the STAP_SESSION_STOPPING state. 2) The hang happens in the 'otf_stress_hard_iter_2000' and 'otf_stress_max_iter_5000' subtests. These tests also probe 'kernel.trace("*")', and are hitting bug #17126 (tracepoints.exp testcase causing stalls/hang on ppc64). When that one gets solved, the hangs should stop here also. If I take the 'kernel.trace("*")' probing out of those subtests, they pass just fine. Hey David, Glad you found the cause of the issue. > 1) The "stap module notifier triggered in unexpected state 3" message happens > when the session is shutting down but the uprobe hasn't been unregistered > yet. This can also occur if there is still a work item left on the workqueue after we changed to the STAP_SESSION_STOPPING state. This is probably why this error message occurs more frequently during stress subtests. > I think we can change the code to not output this message if we in the > STAP_SESSION_STOPPING state. Agreed! Jonathan Not only uprobes_onthefly.exp, but also hrtimer_onthefly.exp seems to cause stalls on ppc64 pretty regualrly. For reference see e.g. https://url.corp.redhat.com/d8cbb4b. Created attachment 7826 [details]
stap-report output
hrtimer_onthefly.exp causes kernel crash on s390x too:
[ 2362.636690] stap module notifier triggered in unexpected state 3
[ 2365.185234] stap module notifier triggered in unexpected state 3
[ 2365.185331] stap module notifier triggered in unexpected state 3
[ 2365.195695] Unable to handle kernel pointer dereference at virtual kernel add
ress (null)
[ 2365.195737] Oops: 0004 [#1] SMP
[ 2365.195740] Modules linked in: stap_67b0151b97c8e1dde8dcd8264a8bbe13__32658(O
F) nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache binfmt_misc ipt_MASQUERA
DE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle tun bridge stp llc ip6table_f
ilter ip6_tables iptable_filter ip_tables ebtable_nat ebtables sg qeth_l2 vmur n
fsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c dasd_fba_mod lcs ctcm fsm das
d_eckd_mod qeth qdio dasd_mod ccwgroup dm_mirror dm_region_hash dm_log dm_mod [l
ast unloaded: stap_cd7de823793eceac21f7a774e5b77a25__27834]
[ 2365.195807] CPU: 0 PID: 32658 Comm: stapio Tainted: GF O------------
-- 3.10.0-123.8.1.el7.s390x #1
[ 2365.195813] task: 000000007ced8910 ti: 000000005dbd4000 task.ti: 000000005dbd
4000
[ 2365.195886] Krnl PSW : 0704d00180000000 00000000005af91e (mutex_lock+0x2e/0x6
0)
[ 2365.195892] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 EA:
3
Krnl GPRS: 0000000000000081 0000000000000001 00000000ffffffff 0000000000000000
[ 2365.195896] 0000000000999f70 0000000000000001 0000000000000000 000
0000000000000
[ 2365.195905] 0000000000000008 0000000000000020 0000000000000001 000
003ffffbebd74
[ 2365.195925] 000000005dbd7d68 0000000000000020 000000005dbd7ca0 000
000005dbd7c80
[ 2365.195936] Krnl Code: 00000000005af916: 5830d000 l %r3,0(%r
13)
00000000005af91a: 1823 lr %r2,%r3
#00000000005af91c: 1b21 sr %r2,%r1
>00000000005af91e: ba32d000 cs %r3,%r2,0(%r13)
00000000005af922: a744fffc brc 4,5af91a
00000000005af926: ec24000d007e cij %r2,0,4,5af940
00000000005af92c: e31003100004 lg %r1,784
00000000005af932: e310d018002401: HCPGSP2629I The virtual machine is
placed in CP mode due to a SIGP stop from
CPU 01.
stg %r1,24(%r13)
[ 2365.196063] Call Trace:
[ 2365.196065] ([<0000000000936ba8>] uaccess+0x0/0x50)
[ 2365.196069] [<0000000000150ff0>] flush_workqueue+0x78/0x5a8
[ 2365.196073] [<000003ff803050de>] _stp_cleanup_and_exit+0xe6/0x5f8 [stap_67b0
151b97c8e1dde8dcd8264a8bbe13__32658]
[ 2365.196094] [<000003ff8030581c>] _stp_ctl_write_cmd+0x22c/0x458 [stap_67b015
1b97c8e1dde8dcd8264a8bbe13__32658]
[ 2365.196111] [<000000000027650a>] vfs_write+0xa2/0x1c8
[ 2365.196116] [<0000000000277018>] SyS_write+0x60/0xb0
[ 2365.196119] [<00000000005b28a4>] sysc_tracego+0x14/0x1a
[ 2365.196124] [<000003fffd76f3e8>] 0x3fffd76f3e8
[ 2365.196129] Last Breaking-Event-Address:
[ 2365.196132] [<0000000000150fea>] flush_workqueue+0x72/0x5a8
[ 2365.196136]
[ 2365.196139] Kernel panic - not syncing: Fatal exception: panic_on_oops
[ 2365.196201] Stap trace buffer for processor 0 sub-buffer 0:
[ 2365.196205] The following may not have been sent to the display:
[ 2365.196208] toggling
[ 2365.196212] systemtap_module_init:59253: not starting (hrtimer) pidx 0
[ 2365.196236] toggling
[ 2365.196237] toggling
[ 2365.196242] toggling
[ 2365.196254] toggling
[ 2365.196255] toggling
[ 2365.196256] toggling
[ 2365.196257] toggling
[ 2365.196257] toggling
[ 2365.196258] toggling
[ 2365.196259] toggling
[ 2365.196259] toggling
[ 2365.196260] toggling
[ 2365.196261] toggling
[ 2365.196262] toggling
[ 2365.196262] toggling
[ ... ]
Another, very similar issue: kprobes_onthefly.exp causes kernel oops on s390x (stap-0f2e612, 3.10.0-123.8.1.el7.s390x): [-- MARK -- Mon Oct 13 10:05:00 2014] [ 2038.634784] stap_cdf292d12c8ba9d0963faee8c05fb1c7_10782: module verification failed: signature and/or required key missing - tainting kernel [-- MARK -- Mon Oct 13 10:10:00 2014] [ 2184.533935] Unable to handle kernel pointer dereference at virtual kernel add ress (null) [ 2184.534007] Oops: 0004 [#1] SMP [ 2184.534011] Modules linked in: stap_5053927516a0f778e4fccdbbd2ca2985__62883(O F) nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ipt_MASQUERADE iptable_n at nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle tun bridge stp llc ip6table_filter ip6_ta bles iptable_filter ip_tables ebtable_nat ebtables sg qeth_l2 vmur nfsd auth_rpc gss nfs_acl lockd sunrpc xfs libcrc32c dasd_fba_mod dasd_eckd_mod dasd_mod qeth qdio lcs ctcm ccwgroup fsm dm_mirror dm_region_hash dm_log dm_mod [last unloaded : stap_b8a904592537701ddfebecd680ea95e2__40242] [ 2184.534087] CPU: 0 PID: 62883 Comm: stapio Tainted: GF O------------ -- 3.10.0-123.8.1.el7.s390x #1 [ 2184.534093] task: 0000000075361220 ti: 00000000510c8000 task.ti: 00000000510c 8000 [ 2184.534097] Krnl PSW : 0704d00180000000 00000000005af91e (mutex_lock+0x2e/0x6 0) [ 2184.534151] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 EA: 3 Krnl GPRS: 0000000000000081 0000000000000001 00000000ffffffff 0000000000000000 [ 2184.534159] 0000000000999f70 0000000000000001 0000000000000000 000 0000000000000 [ 2184.534161] 0000000000000008 0000000000000020 0000000000000001 000 003ffff8ac4d4 [ 2184.534163] 00000000510cbd68 0000000000000020 00000000510cbca0 000 00000510cbc80 [ 2184.534196] Krnl Code: 00000000005af916: 5830d000 l %r3,0(%r 13) 00000000005af91a: 1823 lr %r2,%r3 #00000000005af91c: 1b21 sr %r2,%r1 >00000000005af91e: ba32d000 cs %r3,%r2,0(%r13) 00000000005af922: a744fffc brc 4,5af91a 00000000005af926: ec24000d007e cij %r2,0,4,5af940 00000000005af92c: e31003100004 lg %r1,784 00000000005af932: e310d0180024 stg %r1,24(%r13) [ 2184.534264] Call Trace: 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01. [ 2184.534300] ([<0000000000936ba8>] uaccess+0x0/0x50) [ 2184.534330] [<0000000000150ff0>] flush_workqueue+0x78/0x5a8 [ 2184.534334] [<000003ff80252ce6>] _stp_cleanup_and_exit+0xe6/0x608 [stap_505 927516a0f778e4fccdbbd2ca2985__62883] [ 2184.534391] [<000003ff80253434>] _stp_ctl_write_cmd+0x22c/0x458 [stap_505392 7516a0f778e4fccdbbd2ca2985__62883] [ 2184.534417] [<000000000027650a>] vfs_write+0xa2/0x1c8 [ 2184.534423] [<0000000000277018>] SyS_write+0x60/0xb0 [ 2184.534426] [<00000000005b28a4>] sysc_tracego+0x14/0x1a [ 2184.534431] [<000003fffd0483e8>] 0x3fffd0483e8 [ 2184.534440] Last Breaking-Event-Address: [ 2184.534443] [<0000000000150fea>] flush_workqueue+0x72/0x5a8 [ 2184.534448] [ 2184.534450] Kernel panic - not syncing: Fatal exception: panic_on_oops [ 2184.534528] Stap trace buffer for processor 0 sub-buffer 0: [ 2184.534533] hit [ 2184.534536] rethit [ 2184.534540] toggling [ 2184.534543] toggling [ 2184.534546] hit [ 2184.534549] rethit [ 2184.534553] toggling [ 2184.534556] toggling [ 2184.534560] hit [ 2184.534596] rethit [ 2184.534597] toggling [ 2184.534615] toggling [ ... ] ======= (gdb) list *(flush_workqueue+0x72) 0x150fea is in flush_workqueue (kernel/workqueue.c:2581). 2581 mutex_lock(&wq->mutex); Commit e6f437a works around this problem, by: 1) As mentioned in comment #4, the "stap module notifier triggered in unexpected state 3" message will not be printed in the STAP_SESSION_STOPPING state. 2) The fix for bug #17126 fixed the probing of 'kernel.trace("*")' issues by adding a tracepoint blacklist. However, the 'onthefly' tests use systemtap's '-g' option, which disables the tracepoint blacklist. So, the commit modifies the 'onthefly' tests to not probe the h* tracepoints on ppc64. Created attachment 8053 [details]
test log
Using kernel-3.10.0-220.el7.ppc64 and stap based on commit 772e206 I'm getting following results for the hrtimer testcases:
=======
make RUNTESTFLAGS='hrtimer_onthefly.exp kprobes_onthefly.exp uprobes_onthefly.exp' installcheck
... stuff deleted ...
Running ./systemtap.onthefly/hrtimer_onthefly.exp ...
Running ./systemtap.onthefly/kprobes_onthefly.exp ...
FAIL: kprobes_onthefly - otf_finish_at_start_disabled (invalid output)
FAIL: kprobes_onthefly - otf_start_disabled_iter_1 (invalid output)
FAIL: kprobes_onthefly - otf_start_enabled_iter_1 (invalid output)
FAIL: kprobes_onthefly - otf_start_disabled_iter_2 (invalid output)
FAIL: kprobes_onthefly - otf_start_enabled_iter_2 (invalid output)
FAIL: kprobes_onthefly - otf_start_disabled_iter_3 (invalid output)
FAIL: kprobes_onthefly - otf_start_enabled_iter_3 (invalid output)
FAIL: kprobes_onthefly - otf_start_disabled_iter_4 (invalid output)
FAIL: kprobes_onthefly - otf_start_enabled_iter_4 (invalid output)
FAIL: kprobes_onthefly - otf_start_disabled_iter_5 (invalid output)
FAIL: kprobes_onthefly - otf_start_enabled_iter_5 (invalid output)
FAIL: kprobes_onthefly - otf_timer_100ms (invalid output)
FAIL: kprobes_onthefly - otf_timer_50ms (invalid output)
FAIL: kprobes_onthefly - otf_timer_10ms (invalid output)
Running ./systemtap.onthefly/uprobes_onthefly.exp ...
=== systemtap Summary ===
# of expected passes 54
# of unexpected failures 14
=======
Full test log attached.
(In reply to Martin Cermak from comment #9) > Created attachment 8053 [details] > test log > > Using kernel-3.10.0-220.el7.ppc64 and stap based on commit 772e206 I'm > getting following results for the hrtimer testcases: ... test failures reported ... Could you create a new bug with these failures? If the uprobes_onthefly.exp test didn't hang the system, this bug is fixed. Plus, this bug is filed against uprobes_onthefly.exp, and your failures are in kprobes_onthefly.exp (although your description above says it happened in the hrtimer testcase). If the kprobes_onthefly.exp problems were related to this bug, the system would have hung, you wouldn't have just gotten test failures. Yup, filed PR17831. Thank you for fixing this one. Verified. |