When running the testsuite in parallel mode, I'm fairly consistently seeing the following kernel BUG on 3.10.0-327.el7.ppc64: ======== Nov 11 14:48:47 kernel: BUG: soft lockup - CPU#15 stuck for 22s! [stapio:12395] Nov 11 14:48:47 kernel: Modules linked in: stap_26acea073d2390b6b4b93e91ac0f9692_13862(OE) stap_0c9ef08fe0dae436a4433f8136bb310_13790(OE) stap_bc2a857b167e703bc87d6cae0b9ba9d_13731(OE) stap_b1cc2deb9fb8eb377283fff95e42020_12899(OE) stap_0bc821bb685c8518da835f4f413c3a9e_12842(OE) PROCFS_BUFFER8(OE) stap_6764e915e63f55da6e9c6bb630e1f83_12395(OE) stap_4d4f80217a831e70ed5d551c209bfdf_12296(OE) stap_1fee09fddaf3e217bc1be5bca0b707ca_21658(OE) binfmt_misc tun sg pseries_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_f9cfa97efba72cc8292ebcc4094674e_11936] Nov 11 14:48:47 kernel: CPU: 15 PID: 12395 Comm: stapio Tainted: G OE ------------ 3.10.0-327.el7.ppc64 #1 Nov 11 14:48:47 kernel: task: c0000005f04be910 ti: c0000005eaf6c000 task.ti: c0000005eaf6c000 Nov 11 14:48:47 kernel: NIP: d000000005dc3b80 LR: d000000005dc6e6c CTR: c0000000004b9440 Nov 11 14:48:47 kernel: REGS: c0000005eaf6f710 TRAP: 0901 Tainted: G OE ------------ (3.10.0-327.el7.ppc64) Nov 11 14:48:47 kernel: MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 88004428 XER: 00000000 Nov 11 14:48:47 kernel: CFAR: d000000005dc3ba0 SOFTE: 1 GPR00: d000000005dc6e6c c0000005eaf6f990 d000000005dd6a58 d000000007230000 GPR04: 000000000000bcd7 00000000000001e2 f9b5fab84824369d ffffffffc2b2ae35 GPR08: 000000000001ffff d000000007844910 000000000003029b d000000005dca4b0 GPR12: c0000000004b9440 c000000007b38700 Nov 11 14:48:47 kernel: NIP [d000000005dc3b80] ._stp_map_set_ii+0x100/0x250 [stap_6764e915e63f55da6e9c6bb630e1f83_12395] Nov 11 14:48:47 kernel: LR [d000000005dc6e6c] .probe_2893+0x11c/0x1b00 [stap_6764e915e63f55da6e9c6bb630e1f83_12395] Nov 11 14:48:47 kernel: Call Trace: Nov 11 14:48:47 kernel: [c0000005eaf6f990] [d000000005dc3bdc] ._stp_map_set_ii+0x15c/0x250 [stap_6764e915e63f55da6e9c6bb630e1f83_12395] (unreliable) Nov 11 14:48:47 kernel: [c0000005eaf6fa30] [d000000005dc6e6c] .probe_2893+0x11c/0x1b00 [stap_6764e915e63f55da6e9c6bb630e1f83_12395] Nov 11 14:48:47 kernel: [c0000005eaf6fb30] [d000000005dc674c] .enter_be_probe+0x14c/0x240 [stap_6764e915e63f55da6e9c6bb630e1f83_12395] Nov 11 14:48:47 kernel: [c0000005eaf6fbd0] [d000000005dc9a8c] .systemtap_module_init+0x7fc/0x850 [stap_6764e915e63f55da6e9c6bb630e1f83_12395] Nov 11 14:48:47 kernel: [c0000005eaf6fcd0] [d000000005dc9ccc] ._stp_ctl_write_cmd+0x1ec/0x644 [stap_6764e915e63f55da6e9c6bb630e1f83_12395] Nov 11 14:48:47 kernel: [c0000005eaf6fd80] [c0000000002f089c] .SyS_write+0x14c/0x400 Nov 11 14:48:47 kernel: [c0000005eaf6fe30] [c00000000000a17c] system_call+0x38/0xb4 Nov 11 14:48:47 kernel: Instruction dump: Nov 11 14:48:47 kernel: 7f9c4038 7b9c0020 3b9c000c 7b9c1f24 7d3ee02a 7f5ee214 2fa90000 419e0104 Nov 11 14:48:47 kernel: 3929fff0 60000000 60000000 60420000 <2fa90000> 419e0020 e9490020 7fbf5000 ======== This typically happens once per testsuite run. I don't know yet which test (or set of tests) trigger this problem.
Here a similar lockup BUG, but on the debug kernel (3.10.0-327.el7.ppc64.debug) and running the testsuite in non-parallel mode: ==== Nov 16 10:23:47kernel: BUG: soft lockup - CPU#9 stuck for 23s! [stapio:24254] Nov 16 10:23:47kernel: Modules linked in: stap_7764bfaa23b11806ef1c60b256715ff_24254(OE) sg pseries_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_1f80bbd166a09dc1edf99396614ca76_23951] Nov 16 10:23:47kernel: irq event stamp: 0 Nov 16 10:23:47kernel: hardirqs last enabled at (0): [< (null)>] (null) Nov 16 10:23:47kernel: hardirqs last disabled at (0): [<c0000000000ce5dc>] .copy_process.part.22+0x5dc/0x1040 Nov 16 10:23:47kernel: softirqs last enabled at (0): [<c0000000000ce5dc>] .copy_process.part.22+0x5dc/0x1040 Nov 16 10:23:47kernel: softirqs last disabled at (0): [< (null)>] (null) Nov 16 10:23:47kernel: CPU: 9 PID: 24254 Comm: stapio Tainted: G OE ------------ 3.10.0-327.el7.ppc64.debug #1 Nov 16 10:23:47kernel: task: c0000005ebe0ee70 ti: c0000005f0a80000 task.ti: c0000005f0a80000 Nov 16 10:23:47kernel: NIP: d00000000aba4a78 LR: d00000000aba742c CTR: c00000000053e540 Nov 16 10:23:47kernel: REGS: c0000005f0a83740 TRAP: 0901 Tainted: G OE ------------ (3.10.0-327.el7.ppc64.debug) Nov 16 10:23:47kernel: MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 84004422 XER: 00000007 Nov 16 10:23:47kernel: CFAR: d00000000aba7428 SOFTE: 1 GPR00: d00000000aba742c c0000005f0a839c0 d00000000abba0f0 d00000000ce40000 GPR04: 000000000004c318 00000000000000e9 002db0ae77b1770f ffffffffc2b2ae35 GPR08: 000000000001ffff 0f670cfa7d56e501 c5ebf91ebfada51d 000000000fea18bf GPR12: c00000000053e540 c000000007b35100 Nov 16 10:23:47kernel: NIP [d00000000aba4a78] ._stp_map_set_ii+0xc8/0x180 [stap_7764bfaa23b11806ef1c60b256715ff_24254] Nov 16 10:23:47kernel: LR [d00000000aba742c] .probe_2893+0xfc/0x1a60 [stap_7764bfaa23b11806ef1c60b256715ff_24254] Nov 16 10:23:47kernel: Call Trace: Nov 16 10:23:47kernel: [c0000005f0a839c0] [d00000000aba1048] ._new_map_create+0x58/0xf0 [stap_7764bfaa23b11806ef1c60b256715ff_24254] (unreliable) Nov 16 10:23:47kernel: [c0000005f0a83a50] [d00000000aba742c] .probe_2893+0xfc/0x1a60 [stap_7764bfaa23b11806ef1c60b256715ff_24254] Nov 16 10:23:47kernel: [c0000005f0a83b20] [d00000000aba9acc] .enter_be_probe+0x1bc/0x360 [stap_7764bfaa23b11806ef1c60b256715ff_24254] Nov 16 10:23:47kernel: [c0000005f0a83bc0] [d00000000abaaef8] .systemtap_module_init+0xa18/0xaa0 [stap_7764bfaa23b11806ef1c60b256715ff_24254] Nov 16 10:23:47kernel: [c0000005f0a83cd0] [d00000000abab52c] ._stp_ctl_write_cmd+0x5ac/0x7c0 [stap_7764bfaa23b11806ef1c60b256715ff_24254] Nov 16 10:23:47kernel: [c0000005f0a83d80] [c00000000034cf6c] .SyS_write+0x14c/0x400 Nov 16 10:23:47kernel: [c0000005f0a83e30] [c00000000000a188] system_call+0x38/0xb4 Nov 16 10:23:47kernel: Instruction dump: Nov 16 10:23:47kernel: 7d494a78 7d4941d6 8103005c 55499b7e 7d2a5278 7d2a39d6 5524843e 7c844a78 Nov 16 10:23:47kernel: 7c844038 78840020 3884000c 78841f24 <7d23202a> 7c832214 2fa90000 419e008c ====
I've tracked this down to the systemtap.base/optim_stats.exp testcase. I get the "soft lockup" bug on ppc64 every time I've run this test (admittedly on a very small sample).
(In reply to David Smith from comment #2) > I've tracked this down to the systemtap.base/optim_stats.exp testcase. I get > the "soft lockup" bug on ppc64 every time I've run this test (admittedly on > a very small sample). The optim_stats.exp testcase runs 2 scripts: systemtap.base/optim_stats1.stp and systemtap.base/optim_stats2.stp. The latter script causes the BUG: ==== # stap -g --suppress-time-limits ../src/testsuite/systemtap.base/optim_stats2.stp Message from syslogd@ibm-p8-01-lp7 at Nov 18 13:59:04 ... kernel:BUG: soft lockup - CPU#9 stuck for 23s! [stapio:27761] ERROR: division by 0 near operator '/' at ../src/testsuite/systemtap.base/optim_stats2.stp:48:28 IGNORE 300000000 IGNORE 300000000 29999850000000 0 199999 WARNING: Number of errors: 1, skipped probes: 0 WARNING: /usr/local/bin/staprun exited with status: 1 Pass 5: run failed. [man error::pass5] ====
I'll be inexact but terse: The testcase ensures, that the aggregation operator '<<<' works faster for stats with only computionally simple operators like e.g. @count, then for stats with computionally complex operators like @variance. For more verbose description, please, refer to [1]. Currently we support 6 stat operators: @count, @sum, @min, @max, @avg, and @variance. The optimization in question is based on GCC optimizing the following inlined function based on its parameters: ======= $ grep -A 33 __stp_stat_add runtime/stat-common.c static inline void __stp_stat_add(Hist st, stat_data *sd, int64_t val, int stat_op_count, int stat_op_sum, int stat_op_min, int stat_op_max, int stat_op_variance) { int n; int delta = 0; sd->shift = st->bit_shift; sd->stat_ops = st->stat_ops; if (sd->count == 0) { sd->count = 1; sd->sum = sd->min = sd->max = val; sd->avg_s = val << sd->shift; sd->_M2 = 0; } else { if(stat_op_count) sd->count++; if(stat_op_sum) sd->sum += val; if (stat_op_min && (val > sd->max)) sd->max = val; if (stat_op_max && (val < sd->min)) sd->min = val; /* * Below, we use Welford's online algorithm for computing variance. * https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance */ if (stat_op_variance) { delta = (val << sd->shift) - sd->avg_s; sd->avg_s += _stp_div64(NULL, delta, sd->count); sd->_M2 += delta * ((val << sd->shift) - sd->avg_s); sd->variance_s = (sd->count < 2) ? -1 : _stp_div64(NULL, sd->_M2, (sd->count - 1)); } } $ ======= For example, if @variance isn't being used with given stat, stat_op_variance is set to 0, and GCC is expected to optimize respective computations out. Looking at the above code snippet, it's easy to see, that the effect of optimizing the @variance computations out is much more significant then the effect of optimizing out the other stat op computations. Of course, the effect of such optimizations is also architecture and compiler dependent. The testcase tries to detect all the optimizations and confirm they are there. Detecting optimizations for @count, @sum, @min, @max and @avg is relatively tricky. It's hard to distinguish their optimization effect from the noise. The test results are of a low quality and the test generates lots of load. On the other hand, detecting and verifying the @variance optimization is relatively simple, testing this makes pretty good sense. I've been running the testcase right now in its original form, and it gives all expected passes for most of the rhel 6 and 7 supported arches. But sometimes 'kernel:NMI watchdog: BUG: soft lockup' errors are happening. However, this was using the testsuite serial mode, which certainly gives better results then the parallel mode. So, I propose to drop the first subtest (optim_stats1.stp) for @count, @sum, @min, @max optimizations altogether, since it's "not so much fun for a lot of money", but to keep the second subtest (optim_stats2.stp) for the @variance optimization. Also the high count of iterations in optim_stats2.stp can be lowered down (the values were copied from optim_stats1.stp, but appear to be unnecessarily high). Following seems to help: ======= $ git diff diff --git a/testsuite/systemtap.base/optim_stats.exp b/testsuite/systemtap.base/optim_stats.exp index e46de40..1955853 100644 --- a/testsuite/systemtap.base/optim_stats.exp +++ b/testsuite/systemtap.base/optim_stats.exp @@ -8,7 +8,7 @@ if {![installtest_p]} { return } -for {set i 1} {$i <= 2} {incr i} { +for {set i 2} {$i <= 2} {incr i} { foreach runtime [get_runtime_list] { if {$runtime != ""} { spawn stap --runtime=$runtime -g --suppress-time-limits $srcdir/$subdir/$test$i.stp diff --git a/testsuite/systemtap.base/optim_stats2.stp b/testsuite/systemtap.base/optim_stats2.stp index 53bbc69..65fe06d 100644 --- a/testsuite/systemtap.base/optim_stats2.stp +++ b/testsuite/systemtap.base/optim_stats2.stp @@ -2,9 +2,9 @@ * Analogy to optim_stats1.stp, but for pmaps. See optim_stats1.stp for comments. */ -@define RANDCNT %( 200000 %) +@define RANDCNT %( 2000 %) @define RANDMAX %( 1000 %) -@define ITERS %( 1500 %) +@define ITERS %( 15 %) @define feed(agg, tagg) %( $ ======= Thoughts? ------------------------ [1] https://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=blob;f=testsuite/systemtap.base/optim_stats1.stp;h=2144b7bb210ee8f0c620487ac63fffba14e0d1bf;hb=HEAD
Would it be fair to summarize the prior comment by saying that these test cases deliberately eat CPU in order to test performance statistics of those optimizations. So the "soft lockup" is only an indication of the test running a long time, not any sort of deep problem. One could for example reduce the loop iterations by a factor or two on slower machines. I don't know about the division-by-zero part.
(In reply to Martin Cermak from comment #4) ... stuff deleted ... > Thoughts? With the patch, I see no "soft lockup" BUGs on either the RHEL7 ppc64 system I've been using or on my rawhide-32 VM that also saw the BUG. Note that on RHEL7 ppc64, I'm getting the divide by 0 error, which the optim_stats.exp doesn't expect so you get no pass or fail from using the default linux runtime.
Guys, I am still running the testcase in a loop with various parameters so that I can see which parameters might improve its behaviour. About to finish.
Created attachment 9679 [details] test data Based on attached test data, I've updated the RANDCNT value so that the testcase generates less CPU load still keeping the test result reasonable on most arches. Commit 71aa65f2e9ee4f1d5b33c76dfb4df66600b81fcf.
Closing. Let's reopen in case of need.
I played around with this one a bit more, trying to fix the "division by zero" error. To fix this, I switched to using the monotonic clock function cpu_clock_us() instead of using gettimeofday_us() on the linux runtime. I then started getting MAXNESTING errors with the dyninst runtime, so I switched the time() function to a macro. Here's the patch: ==== diff --git a/testsuite/systemtap.base/optim_stats1.stp b/testsuite/systemtap.base/optim_stats1.stp index 4348ec2..884f753 100644 --- a/testsuite/systemtap.base/optim_stats1.stp +++ b/testsuite/systemtap.base/optim_stats1.stp @@ -4,17 +4,24 @@ @define feed(agg, tagg) %( - t = time() + t = @time foreach(k in randvals) @agg <<< k - @tagg += time() - t + @tagg += @time - t %) global x, tx = 0, y, ty = 0 global a, ta = 0, b, tb = 0 global randvals[@RANDCNT] -function time() { return gettimeofday_us() } +@define time +%( + %( runtime == "dyninst" %? + gettimeofday_us() + %: + cpu_clock_us(cpu()) + %) +%) probe begin { diff --git a/testsuite/systemtap.base/optim_stats2.stp b/testsuite/systemtap.base/optim_stats2.stp index 4e28bdd..4ca82a9 100644 --- a/testsuite/systemtap.base/optim_stats2.stp +++ b/testsuite/systemtap.base/optim_stats2.stp @@ -4,17 +4,24 @@ @define feed(agg, tagg) %( - t = time() + t = @time foreach(k in randvals) @agg <<< k - @tagg += time() - t + @tagg += @time - t %) global x, tx = 0, y, ty = 0 global a, ta = 0, b, tb = 0 global randvals[@RANDCNT] -function time() { return gettimeofday_us() } +@define time +%( + %( runtime == "dyninst" %? + gettimeofday_us() + %: + cpu_clock_us(cpu()) + %) +%) probe begin { ==== With these changes, I get no "division by zero" errors or MAXNESTING errors, but I do start to see a FAIL. ==== Running /root/src/testsuite/systemtap.base/optim_stats.exp ... PASS: TEST1 (0, 0) PASS: TEST2 (10, 45) PASS: TEST1 dyninst (0, 2) PASS: TEST2 dyninst (10, 38) FAIL: TEST3 (0, -4) PASS: TEST4 (10, 16) PASS: TEST3 dyninst (0, 8) PASS: TEST4 dyninst (10, 26) ==== Thoughts?
Created attachment 9681 [details] test results Hi David, attached test results show the effect of the above patch on my test boxes. Actually, I have 12 more ppc64 test boxes not reported there, but showing similar results. No division by zero observed at any of them yet (although especially one of them looks very close to yours). I wonder whether there is some difference in how we test.
(In reply to Martin Cermak from comment #11) > Hi David, attached test results show the effect of the above patch on my > test boxes. Actually, I have 12 more ppc64 test boxes not reported there, > but showing similar results. No division by zero observed at any of them > yet (although especially one of them looks very close to yours). I wonder > whether there is some difference in how we test. There is probably a difference in our source code. I'm running with a patch to the time initialization code that you don't have. As part of my work debugging bug #20735, I updated/improved the __stp_get_freq() runtime C function (in runtime/time.c). Currently, what it does it get the cpu frequency in an arch-specific way. If there isn't arch-specific support for a particular architecture, it tries to estimate the cpu frequency. On ppc64, that estimated cpu frequency is used. While looking at that code, I decided to add some arch-specific ppc64 code that grabs the cpu frequency. Later, I then found a arch-independent way of getting the cpu frequency on kernels that have CONFIG_CPU_FREQ. I've been running with that patch for about a month now, but just haven't checked it in yet. I'd guess the more accurate cpu frequency returned by the new code is changing the values returned by the gettimeofday_us() tapset function.
Created attachment 9682 [details] proposed __stp_init_time() patch In case anyone is interested, here's the __stp_init_time() patch I've been testing.
Comment on attachment 9682 [details] proposed __stp_init_time() patch > - write_seqlock(&time->lock); > + write_seqlock_irqsave(&time->lock, flags2); > time->base_ns = ns; > time->base_cycles = cycles; > - write_sequnlock(&time->lock); > + write_sequnlock_irqrestore(&time->lock, flags2); > > local_irq_restore(flags); Why do you need to save irq if it's already in a local_irq_save section?
(In reply to Josh Stone from comment #14) > Comment on attachment 9682 [details] > proposed __stp_init_time() patch > > > - write_seqlock(&time->lock); > > + write_seqlock_irqsave(&time->lock, flags2); > > time->base_ns = ns; > > time->base_cycles = cycles; > > - write_sequnlock(&time->lock); > > + write_sequnlock_irqrestore(&time->lock, flags2); > > > > local_irq_restore(flags); > > Why do you need to save irq if it's already in a local_irq_save section? Good question. Everywhere I can find in the kernel that uses *_irqsave() on 2 different locking items in the same function, that code uses *_irqsave() on both locking items. Several months back I looked at the source of write_seqlock() vs. write_seqlock_irqsave(), and if I remember correctly the waiting method was different between the two. So, we're not using the 2nd write_seqlock_irqsave() call to disable interrupts again, but to wait in the most appropriate way once interrupts are disabled.
OK, thanks -- that deserves a brief comment at least.
David, will you check the test case improvement from comment #10 in as a part of your time initialization work?
(In reply to Martin Cermak from comment #17) > David, will you check the test case improvement from comment #10 in as a > part of your time initialization work? I don't mind checking it in if you are comfortable with it.
Commit 0f7e138 adds those test case improvements.