This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Hashtable


OK, The first thing to do is run your script with 1 addition. Add
'-DSTP_ALIBI' to the stap command line. This will compile/run the
script, but when a probe is hit it will return immediately.

This will give you an idea of the overhead of kprobes itself.

Another option would be to remove the '-DSTP_ALIBI' from the command
line and '-t' to the command line. Here's a description of what that
option does:

       -t     Collect timing information on the number of times probe executes
              and average amount of time spent in each probe-point. Also shows
              the derivation for each probe-point.

Once you've done that, we'll know more of what is going on.


On Thu, Jul 6, 2017 at 12:50 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
> P.S.2
> Convenient links for copy&paste
>
> https://gist.githubusercontent.com/larytet/10ceddea609d2da17aa09558ed0e04bc/raw/05037d536e5edf0e2f5a45282c41b8fa46d1fd55/SystemTap_tests.sh
>
>
> https://gist.githubusercontent.com/larytet/fc147587e9dfecfe99ab6bac2ba4aaa0/raw/670e385cb76798b526de9f4265046cf576c42f4e/SystemTap_tests
>
> On Thu, Jul 6, 2017 at 8:45 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> P.S. The performance is very sensitive to TRYLOCKDELAY which is expected.
>>
>>
>> On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>>> On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote:
>>>> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> I have a CPU bottleneck in some situations on heavy loaded servers.
>>>>>
>>>>> From the tests it appears that associative maps contribute significant
>>>>> part of the overhead.
>>>>
>>>> ... stuff deleted ...
>>>>
>>>> Can you show us your script (or the associate map portion) that
>>>> illustrates the performance problem? Perhaps we can make some
>>>> suggestions.
>>>>
>>> My test is a tight loop
>>>
>>> file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;
>>> end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>
>>> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file
>>>
>>> I run a number of these - one per core.
>>>
>>>
>>> My stap script is something like this (8 probes)
>>>
>>> stap  -g   -e '%{long long counter;u8 shm[256];static void*
>>> w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return
>>> shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe
>>> syscall.close.return {%{ {counter++;w_shm();} %}} probe
>>> syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>>> {counter++;w_shm();} %}} probe syscall.read.return{%{
>>> {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();}
>>> %}} probe end {  %{ {printk("\n%lli\n", counter);} %}}'
>>>
>>>
>>> w_shm() simulates writes to the shared memory.
>>> The performance impact is ~15% for 4 cores
>>>
>>> I am adding a map (global ar%):
>>>
>>> stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g   -e
>>> 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8
>>> shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0,
>>> sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.open{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.read{w_ar();%{
>>> {counter++;w_shm();} %}} probe end {  %{ {printk("\n%lli\n",
>>> counter);} %}}'
>>>
>>> I am getting 35% hit. The overhead grows with the number of cores.
>>>
>>> The scripts roughly reflect what I am doing in the actual code. I have
>>> 1-3 associative arrays per syscall type. For example I keep separate
>>> arrays for probe syscall.read and probe syscall.write
>>>
>>> I have ~30 probes - I/O, networking, thread life cycle.
>>>
>>>> (Also note that I've started a background personal task to reduce the
>>>> use of locks in systemtap. I don't have much to show for it yet.)
>>>>
>>>
>>> It looks like the performance of the probes does not scale well with
>>> the number of cores. The overhead increases with the number of cores
>>> growing. I suspect that the spin locks at the beginning of every probe
>>> are to blame.
>>>
>>>> --
>>>> David Smith
>>>> Principal Software Engineer
>>>> Red Hat
>>>
>>> Thank you, Arkady.



-- 
David Smith
Principal Software Engineer
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]