This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: architecture paper draft

From: William Cohen <wcohen at redhat dot com>
To: "Chen, Brad" <brad dot chen at intel dot com>
Cc: "Frank Ch. Eigler" <fche at redhat dot com>, "Stephen C. Tweedie" <sct at redhat dot com>, systemtap at sources dot redhat dot com
Date: Fri, 11 Feb 2005 11:23:58 -0500
Subject: Re: architecture paper draft
References: <75EC4D5486CAC247B84AAAA6F96AA558043696B9@orsmsx402.amr.corp.intel.com>

Chen, Brad wrote:

Frank Ch. Engler wrote:
In addition, this method may require that the kprobes handler not be
started from an interrupt context wrapped around the "int 3" trap
(x86).
Changing this might require extensive changes to kprobes, to perhaps
insert "simple" diversionary branches into the executable image
instead
of traps.  Intel folks prefer this sort of approach for performance
reasons, but we may have come across an even better reason for it.
Thank you for noting my earlier question about interrupt overhead.
I said I would do a little homework on interrupt overhead; here it is:
	Cycle delay by CPU	Branch	Trap
	 1.6 GHz Pentium 4	149		1408
	  AMD Athalon 1800 	38		361
	 1.6 GHz Pentium M	84		541
These numbers are from the kerninst team from the University of Wisconsin and I did not verify them myself. In general it looks like a trap is 7-10x more expensive than a branch. It appears to me that kprobes requires three traps, so that would make the overall impact 20-30x more expensive.for

Do you have a pointer to where the paper containing this information? 149 cycles for branch overhead sounds rather high for a processor even if it has to flush pipelines. This includes the code for saving and restoring the registers? Why are 2-3 traps required? For kernel instrumentation only one is required when a probe is executed.

It looks like the Pentium 4 does much worse at the traps than the other processors. The Pentium 4 example uses the processor that has the highest overhead. Redone table assuming 1% clock cycles used by overhead of mechanism and one trap per probe.


			branch			traps
			samples/sec		samp/sec
pentium m (1.6ghz)	190e3			44e3
athlon	1800 (1.53ghz)	183e3			28e3
pentium 4 (1.6ghz)	107e3			11e3

For Example: Assume a 1.6GHz Pentium 4 Branch overhead: 149 cycles Overhead for one trap: about 1400 cycles Kprobes requires 2-3 traps 1% overhead => 16M cycles trap-based instrumentation: 5000 probes per second branch-based instrumentation: 94000 probes per second

For many tools, most time will be spent in analysis code and this issue is irrelevant. However, if you happen to be a performance guy, and you're trying to do something even moderately aggressive in terms of higher frequency or very low overhead, this might start to matter. If this also helps to simplify some of the interrupt management issues, that's great.

I note in passing that the SPARC implementation of DTrace is reported to use branches, and their x86 implementation uses traps.

Figuring out the length of an x86 instruction is a non-trivial task. Using the int3 on x86 avoids that pain.

-Will

References:
- RE: architecture paper draft
  - From: Chen, Brad

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]