This is the mail archive of the mailing list for the systemtap project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: stptracer-20060828 has released.

Martin Hunt wrote:
On Thu, 2006-08-31 at 16:30 +0800, Li Guanglei wrote:

The first thing I want to figure out about STPTracer is how it
performs compared with LKET. STPTracer uses an interface named gBTI
which could only print fix number integers into a pre-reserved relayfs
buffer, while LKET uses the _stp_printf with binary print support.
handlers, i.e.,

       probe syscall.getsid {}
cpu 0: loops=5000000, average=1523 ns

<3> probe using lket_getsid.stp:
cpu 0: loops=5000000, average=3079 ns

<4> probe using lkst_getsid.stp:
cpu 0: loops=5000000, average=2341 ns

LKET overhead = 3079 - 1523 = 1556 ns LKST overhead = 818 ns difference is 738 ns

Here you can see that gBTI is around 23% faster than _stp_printf().

I'd say LKET is almost twice as fast on your machine, factoring out probe overhead.

Looking at the source code, I can't say anything about stp_printf() vs
gbti.  Certainly stp_printf cannot be taking 738 ns to simply print 4

fact a long time ago I used oprofile to find out where _stp_printf()
cost cpu time, and it showed that a lot time is spent on _stp_vsnprintf().

Yeah, stp_printf() is just a wrapper to call stp_vsnprintf().

When I proposed and implemented stp_printf, I did a lot of benchmarks.
So I reran them with the current code vs a very simple, fast gbti-type
interface without the headers and timestamping.

function stp_binary_write(n:long, arg1:long, arg2:long, arg3:long, arg4:long) %{
        _stp_binary_write (THIS->n, THIS->arg1, THIS->arg2, THIS->arg3, THIS->arg4);
static void _stp_binary_write (int num, ...)
        va_list vargs;
        int i;
        int64_t *args;

        if (unlikely(num > MAXBINARYARGS))
                num = MAXBINARYARGS;

args = _stp_reserve_bytes(num * sizeof(int64_t));

        if (args != NULL) {
                va_start(vargs, num);
                for (i = 0; i < num; i++) {
                        args[i] = va_arg(vargs, int64_t);

So here are the results of my 2.6GHz P4.

SystemTap BENCH2        Tue Sep  5 10:52:37 EDT 2006
kernel: 2.6.17-1.2174_FC5smp i686
Fedora Core release 5 (Bordeaux)
dragon:  10:52:37 up 12 min,  3 users,  load average: 0.71, 0.54, 0.38
processors: 2 (1 physical) Intel(R) Pentium(R) 4 CPU 2.60GHz
MemTotal:      2074472 kB       MemFree:       1303368 kB
For comparison, function call overhead is 471 nsecs.
Times below are nanoseconds per probe and include kprobe overhead.
+--- S = Script, R = Runtime
|+-- * = Relayfs                Threads
|| NAME                             1       2
S : empty probe                   601     506
S : binary printf 4 integers (%8b)       1073     899
S : binary printf 4 integers (%4b)        997     767
S : binary printf 4 integers (%b)         953     745
S : _stp_binary_write 4 integers          988     779

So you can see a slight advantage for _stp_binary_write when you are writing
64-bit values. When the sizes are smaller, stp_printf has a slight advantage.


I also reran the testing using _stp_printf() and _stp_binary_write() to print 4 integers instead of using _lket_trace() and lkst_trace_n() to avoid the overhead of the leading integers such as timestamp. I still use the same application which will call getsid() repeatedly and the nanosecond for each loop is calculated by (total execution time of app/total loops)

Here are the scripts I used:

1.stp for testing _stp_binary_write():

function foo(a,b,c,d,e)
   _stp_binary_write(THIS->a, THIS->b, THIS->c, THIS->d, THIS->e);

probe syscall.getsid

probe begin {
      /* initial event */

2.stp for testing _stp_printf():

function foo(a,b,c,d,e)
   _stp_printf("%8b%8b%8b%8b", THIS->b, THIS->c, THIS->d, THIS->e);

probe syscall.getsid

Testing results:

The first data column is the nanosecond spent by a single loop of getsid() call, the second column is the time spent on data printing only for each probe hit.

  On a Power5 LPAR
   (LPAR, 1 dedicated Power5 CPU, SMP, 2G-memory,
   no probing                       391
   empty probe handler              1374
   binary printf 4 integers (%8b)   2063    689
   binary printf 4 integers (%4b)   1980    606
   binary printf 4 integers (%1b)   1895    521
   _stp_binary_write 4 integers     1717    343

On a ThinkPad (T43, P-M 2GHz CPU, 1G Memory, no probing 213 empty probe handler 446 binary printf 4 integers (%8b) 820 374 binary printf 4 integers (%4b) 747 301 binary printf 4 integers (%1b) 714 271 _stp_binary_write 4 integers 682 236

Both Power5(ppc64) and my thinkpad show that gBTI is almost twice as fast as _stp_printf(%8b) and although in P-M CPU is not so obvious as Power5.

Another interesting thing is that the Power5 will take ~983ns to execute an empty kprobe handler(1372-391) while the thinkpad only takes ~233ns, almost *FOUR* times as fast as Power5. I guess it's because of the kprobe-booster patch and it hasn't gone into ppc64 yet, Right?

- Guanglei

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]