Bug 20279 - in parallel testsuite mode, we're getting SIGSEGVs
Summary: in parallel testsuite mode, we're getting SIGSEGVs
Status: NEW
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-20 21:30 UTC by David Smith
Modified: 2016-06-21 17:32 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Smith 2016-06-20 21:30:37 UTC
When running the testsuite in parallel mode, I see several of the following messages on the console:

[ 5619.908020] rm[61446]: unhandled signal 11 at 00003fffdc83ff90 nip 00003fffa8a51f38 lr 00003fffa8a34008 code 30001
[ 6611.824811] bz5274[54850]: unhandled signal 11 at 0000000000000000 nip 0000000000000000 lr 0000000000000000 code 30001

On RHEL7 ppc64, I see around 7 of these during a full testsuite run. I don't see these errors when the testsuite is run in non-parallel mode.

I'm not 100% sure it is related, but the testsuite also got hung at the end of this run, waiting on a 'loop' process (from unprivileged_probes.exp) to be killed.
Comment 1 David Smith 2016-06-20 21:40:28 UTC
For reference sake, here are the 7 messages:

[ 1206.152228] times[1039]: unhandled signal 11 at ffffffffffffffff nip 00003fffa18147b4 lr 0000000010000700 code 30001
[ 1568.066789] times[18596]: unhandled signal 11 at ffffffff nip 0fd970d8 lr 1000050c code 30001
[ 2194.143278] times[9533]: unhandled signal 11 at ffffffffffffffff nip 00003fff8d1347b4 lr 0000000010000700 code 30001
[ 2409.046690] times[30608]: unhandled signal 11 at ffffffff nip 0fd970d8 lr 1000050c code 30001
[ 5619.908020] rm[61446]: unhandled signal 11 at 00003fffdc83ff90 nip 00003fffa8a51f38 lr 00003fffa8a34008 code 30001
[ 6611.824811] bz5274[54850]: unhandled signal 11 at 0000000000000000 nip 0000000000000000 lr 0000000000000000 code 30001
[ 7880.699568] stap[21140]: unhandled signal 11 at 0000000000000000 nip 0000000000000000 lr 00003fff7b2cd188 code 30001

So, this happened to several different exes: times, rm, bz5274, and stap. I'm also unsure of why the kernel reported this on the console, I don't believe it normally does that.
Comment 2 Josh Stone 2016-06-21 00:10:52 UTC
FWIW, signal 11 is SIGSEGV.

I do seem to recall that the rlimit test triggers these on purpose.  I would not expect that to affect others tests running in parallel though.
Comment 3 David Smith 2016-06-21 13:37:50 UTC
(In reply to Josh Stone from comment #2)
> FWIW, signal 11 is SIGSEGV.

Sigh. That's what I get for relying on my memory.
 
> I do seem to recall that the rlimit test triggers these on purpose.  I would
> not expect that to affect others tests running in parallel though.

When I run the rlimit.exp test by itself, I don't see any of those messages on the console. That test will involve stap calling getrlimit/setrlimit, but that should only affect that particular stap pid (and its descendants), not other previous or future stap processes.

There is another test, bad-code.exp, that purposely sends a SIGSEGV, but when that test is run I don't see the message on the console.

I need to poke around in the kernel and see why it prints these messages - why these SIGSGVs are different than regular SIGSEGVs.
Comment 4 Josh Stone 2016-06-21 17:32:21 UTC
This exact "unhandled signal" message appears to be powerpc only, from _exception() in arch/powerpc/kernel/traps.c.  It looks like there are a few calls in there with SIGSEGV.