Bug 24884 - stapdyn crashes with a segmentation fault
Summary: stapdyn crashes with a segmentation fault
Status: WAITING
Alias: None
Product: systemtap
Classification: Unclassified
Component: dyninst (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Stan Cox
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-05 16:42 UTC by Avi Kivity
Modified: 2020-06-19 15:02 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2019-12-10 00:00:00


Attachments
dyninst static marker test (919 bytes, application/x-xz-compressed-tar)
2019-12-10 21:29 UTC, Stan Cox
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Avi Kivity 2019-08-05 16:42:59 UTC
Trying the following script:


#!/usr/bin/stap

# usage: task-histogram.stap process_name

global hist

probe process.mark("reactor_run_tasks_single_start") {
    ++hist[tid(), $arg1]
}

probe end {
    foreach ([tid, addr] in hist) {
        printf("%10d %8d 0x%x\n", hist[tid, addr], tid, addr)
    }
}


(trying to collect a histogram of tasks)

With this command line:

    stap --dyninst -x $(pgrep -x httpd) ./debug/task-histogram.stap

Crashes with

WARNING: /usr/bin/stapdyn exited with signal: 11 (Segmentation fault)


systemtap-4.1-1.fc30.x86_64
Comment 1 Frank Ch. Eigler 2019-08-23 22:20:59 UTC
Hi, Avi, sorry for not noticing this earlier.  Some questions to assist in the local reproduction of this problem:

- arch: x86-64?

- same script works in lkm (non-dyninst) mode?

- tried   stap -p4 --dyninst FOO.stp  ;   gdb -args stapbpf FOO.so   
  so as to get a gdb backtrace at the crash site?

- what level of traffic is the httpd process absorbing during this time?
  (thus: how much thread / child-process changes?)

- tried targeting a program other than this httpd?
Comment 2 Avi Kivity 2019-11-12 13:08:23 UTC
Sorry for noticing _your_ comment so late. I retested with systemtap-4.1-2.fc30.x86_64, and it appears to work.

(x86_64, don't remember if I tried lkm, httpd has no forks/pthread_creates at all)
Comment 3 Avi Kivity 2019-11-12 13:11:24 UTC
The performance impact is horrendous however, 5X slower (251k req/sec without the script, 47k with the script). Does dyninst rewrite the entire program or just the entry points to the probe?
Comment 4 Avi Kivity 2019-11-12 13:21:11 UTC
And now I get segmentation faults again.

#0  int_process::removeAllBreakpoints (this=0x55bf9c39eab8) at /usr/include/c++/9/bits/stl_tree.h:208
#1  0x00007f0c1737589f in linux_process::preTerminate (this=0x55bf9c39e820) at /usr/src/debug/dyninst-10.0.0-7.fc30.x86_64/dyninst-10.0.0/proccontrol/src/linux.C:1740
#2  0x00007f0c1733a4d1 in Dyninst::ProcControlAPI::ProcessSet::terminate (this=0x55bfb12f09e0) at /usr/src/debug/dyninst-10.0.0-7.fc30.x86_64/dyninst-10.0.0/proccontrol/src/procset.C:1644
#3  0x00007f0c172d95db in Dyninst::ProcControlAPI::Process::terminate (this=<optimized out>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:732
#4  0x00007f0c17da7ab5 in PCProcess::terminateProcess (this=0x55bfa0f37cb0) at /usr/include/boost/smart_ptr/shared_ptr.hpp:732
#5  PCProcess::terminateProcess (this=0x55bfa0f37cb0) at /usr/src/debug/dyninst-10.0.0-7.fc30.x86_64/dyninst-10.0.0/dyninstAPI/src/dynProcess.C:1027
#6  0x00007f0c17db377c in PCProcess::attachProcess (progpath=..., pid=16581, analysisMode=BPatch_normalMode) at /usr/src/debug/dyninst-10.0.0-7.fc30.x86_64/dyninst-10.0.0/dyninstAPI/src/dynProcess.C:162
#7  0x00007f0c17cf269a in BPatch_process::BPatch_process(char const*, int, BPatch_hybridMode) () at /usr/src/debug/dyninst-10.0.0-7.fc30.x86_64/dyninst-10.0.0/dyninstAPI/src/BPatch_process.C:328
#8  0x00007f0c17ccf027 in BPatch::processAttach (this=<optimized out>, path=0x0, pid=16581, mode=BPatch_normalMode) at /usr/src/debug/dyninst-10.0.0-7.fc30.x86_64/dyninst-10.0.0/dyninstAPI/src/BPatch.C:1260
#9  0x000055bf9abb4519 in ?? ()
#10 0x000055bf9c362440 in ?? ()
#11 0x00007ffd51a63ec0 in ?? ()
#12 0x00000000000040c5 in probe_13397 ()
#13 0x0000000000000000 in ?? ()
Comment 5 Avi Kivity 2019-11-12 13:25:05 UTC
I think the trigger for the crash is re-attaching to a process after detaching from it.
Comment 6 Avi Kivity 2019-11-12 13:38:10 UTC
And the cause for the slowness is lock contention. With only two threads. Please please please add thread-local storage to the language.
Comment 7 Frank Ch. Eigler 2019-11-23 01:26:41 UTC
(In reply to Avi Kivity from comment #6)
> And the cause for the slowness is lock contention. With only two threads.
> Please please please add thread-local storage to the language.

It's more of a runtime issue than a language issue, but yeah.  Surely there are some optimization opportunities in what we emit for:

    probe process.mark("reactor_run_tasks_single_start") {
        ++hist[tid(), $arg1]
    }
Comment 8 Avi Kivity 2019-11-24 08:57:10 UTC
I imagine that if you notice that a key component is always tid() (except an in an end probe) then you can rewrite the global map as a thread local map with extra magic for the end probe.

But it seems fragile, as soon as you violate one of the constraints even a tiny bit, it stops working with no feedback to the user about what went wrong. And when it stops working, it's likely to have a huge impact on the running workload.
Comment 9 Avi Kivity 2019-12-04 10:31:56 UTC
Is there more information I can supply to help fix the segmentation fault?
Comment 10 Frank Ch. Eigler 2019-12-09 22:04:30 UTC
Stan might be able to help with the dyninst segv up in comment #4.
OTOH, there is a dyninst 10.1 build in stable updates, which would be worth retesting against.
Comment 11 Stan Cox 2019-12-10 21:29:26 UTC
Created attachment 12120 [details]
dyninst static marker test
Comment 12 Stan Cox 2019-12-10 21:31:36 UTC
I'll try with httpd; meanwhile a synthetic looping test using static markers seems to work fine:
 stap --dyninst -x $(pgrep -x tstgetline.x) ./tstgetline.stp
         3    29151 0x20a1260
with:
 dyninst-10.1.0-4.fc30.x86_64
 systemtap-4.2-1.fc30.x86_64
Comment 13 Stan Cox 2019-12-11 15:57:42 UTC
| I think the trigger for the crash is re-attaching to a process after detaching from it.

That sounds similar to bug 23513
Comment 14 Avi Kivity 2019-12-11 16:27:30 UTC
The httpd in question is not Apache httpd. I can provide a binary (and source of course) if needed. Meanwhile I'm following the detach bug.
Comment 15 Avi Kivity 2019-12-11 16:29:03 UTC
And please^19, do make it possible to attach probes to tracepoints that are hit with very high frequency.
Comment 16 Stan Cox 2020-06-19 15:02:24 UTC
> I can provide a binary (and source of course)

Yes please, that would be helpful