This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: stptracer-20060828 has released.
- From: Li Guanglei <guanglei at cn dot ibm dot com>
- To: Martin Hunt <hunt at redhat dot com>
- Cc: Masami Hiramatsu <masami dot hiramatsu dot pt at hitachi dot com>, SystemTAP <systemtap at sources dot redhat dot com>, "Jose R. Santos" <jrs at us dot ibm dot com>, Ananth N Mavinakayanahalli <ananth at in dot ibm dot com>
- Date: Thu, 07 Sep 2006 18:05:05 +0800
- Subject: Re: stptracer-20060828 has released.
- Organization: IBM CSTL
- References: <44F58A51.6060806@hitachi.com> <44F69E3C.10308@cn.ibm.com> <1157469166.5579.14.camel@dragon>
Martin Hunt wrote:
On Thu, 2006-08-31 at 16:30 +0800, Li Guanglei wrote:
The first thing I want to figure out about STPTracer is how it
performs compared with LKET. STPTracer uses an interface named gBTI
which could only print fix number integers into a pre-reserved relayfs
buffer, while LKET uses the _stp_printf with binary print support.
handlers, i.e.,
probe syscall.getsid {}
cpu 0: loops=5000000, average=1523 ns
<3> probe using lket_getsid.stp:
cpu 0: loops=5000000, average=3079 ns
<4> probe using lkst_getsid.stp:
cpu 0: loops=5000000, average=2341 ns
LKET overhead = 3079 - 1523 = 1556 ns
LKST overhead = 818 ns
difference is 738 ns
Here you can see that gBTI is around 23% faster than _stp_printf().
I'd say LKET is almost twice as fast on your machine, factoring out
probe overhead.
Looking at the source code, I can't say anything about stp_printf() vs
gbti. Certainly stp_printf cannot be taking 738 ns to simply print 4
integers.
In
fact a long time ago I used oprofile to find out where _stp_printf()
cost cpu time, and it showed that a lot time is spent on _stp_vsnprintf().
Yeah, stp_printf() is just a wrapper to call stp_vsnprintf().
When I proposed and implemented stp_printf, I did a lot of benchmarks.
So I reran them with the current code vs a very simple, fast gbti-type
interface without the headers and timestamping.
function stp_binary_write(n:long, arg1:long, arg2:long, arg3:long, arg4:long) %{
_stp_binary_write (THIS->n, THIS->arg1, THIS->arg2, THIS->arg3, THIS->arg4);
%}
static void _stp_binary_write (int num, ...)
{
va_list vargs;
int i;
int64_t *args;
if (unlikely(num > MAXBINARYARGS))
num = MAXBINARYARGS;
args = _stp_reserve_bytes(num * sizeof(int64_t));
if (args != NULL) {
va_start(vargs, num);
for (i = 0; i < num; i++) {
args[i] = va_arg(vargs, int64_t);
}
va_end(vargs);
}
}
So here are the results of my 2.6GHz P4.
SystemTap BENCH2 Tue Sep 5 10:52:37 EDT 2006
kernel: 2.6.17-1.2174_FC5smp i686
Fedora Core release 5 (Bordeaux)
dragon: 10:52:37 up 12 min, 3 users, load average: 0.71, 0.54, 0.38
processors: 2 (1 physical) Intel(R) Pentium(R) 4 CPU 2.60GHz
MemTotal: 2074472 kB MemFree: 1303368 kB
----------------------------------------------------------------
For comparison, function call overhead is 471 nsecs.
Times below are nanoseconds per probe and include kprobe overhead.
----------------------------------------------------------------
+--- S = Script, R = Runtime
|+-- * = Relayfs Threads
|| NAME 1 2
S : empty probe 601 506
S : binary printf 4 integers (%8b) 1073 899
S : binary printf 4 integers (%4b) 997 767
S : binary printf 4 integers (%b) 953 745
S : _stp_binary_write 4 integers 988 779
So you can see a slight advantage for _stp_binary_write when you are writing
64-bit values. When the sizes are smaller, stp_printf has a slight advantage.
Martin
I also reran the testing using _stp_printf() and _stp_binary_write()
to print 4 integers instead of using _lket_trace() and lkst_trace_n()
to avoid the overhead of the leading integers such as timestamp. I
still use the same application which will call getsid() repeatedly and
the nanosecond for each loop is calculated by (total execution time of
app/total loops)
Here are the scripts I used:
1.stp for testing _stp_binary_write():
function foo(a,b,c,d,e)
%{
_stp_binary_write(THIS->a, THIS->b, THIS->c, THIS->d, THIS->e);
%}
probe syscall.getsid
{
foo(4,1,2,3,4)
}
probe begin {
/* initial event */
gbti_init()
}
2.stp for testing _stp_printf():
function foo(a,b,c,d,e)
%{
_stp_printf("%8b%8b%8b%8b", THIS->b, THIS->c, THIS->d, THIS->e);
%}
probe syscall.getsid
{
foo(4,1,2,3,4)
}
Testing results:
The first data column is the nanosecond spent by a single loop of
getsid() call, the second column is the time spent on data printing
only for each probe hit.
On a Power5 LPAR
(LPAR, 1 dedicated Power5 CPU, SMP, 2G-memory, 2.6.17.9)
no probing 391
empty probe handler 1374
binary printf 4 integers (%8b) 2063 689
binary printf 4 integers (%4b) 1980 606
binary printf 4 integers (%1b) 1895 521
_stp_binary_write 4 integers 1717 343
On a ThinkPad
(T43, P-M 2GHz CPU, 1G Memory, 2.6.17.6)
no probing 213
empty probe handler 446
binary printf 4 integers (%8b) 820 374
binary printf 4 integers (%4b) 747 301
binary printf 4 integers (%1b) 714 271
_stp_binary_write 4 integers 682 236
Both Power5(ppc64) and my thinkpad show that gBTI is almost twice as
fast as _stp_printf(%8b) and although in P-M CPU is not so obvious as
Power5.
Another interesting thing is that the Power5 will take ~983ns to
execute an empty kprobe handler(1372-391) while the thinkpad only
takes ~233ns, almost *FOUR* times as fast as Power5. I guess it's
because of the kprobe-booster patch and it hasn't gone into ppc64 yet,
Right?
- Guanglei