This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Experiences with kprobes


Baruch Even wrote:
Hello,

Just thought to share my current experience with kprobes, it might interest some of you.

I'm trying to improve the performance of the Linux TCP stack (as an end-host not a router), as such I need to measure the current performance in order to search for bottlenecks.

I had a first version where I simply wrapped the the calls I needed with rdtsc calls inline and added other measurements (number of packets acked for each ACK packet and such). This worked beautifully, and I got some nice results and some pretty good improvements as well.

They say "if it ain't broken don't fix it", but if it's not broken it's no fun[0], so I tried to use kprobes as a way to get the measurement code out of my current code patches. The thinking was that it will be a lot easier to maintain the patches ready for LKML submission.

I also ported my code to 2.6.11 (since that's where kprobes is available, I was on 2.6.6 before and no kprobes there[1]), and got abysmal performance. After a bit of digging the overhead of the kprobes approach was the only possible problem, if with the old method I got a timing of about 3000 clocks on my machine[2], with the new one I got at least 10000 with about 3 kprobes and 3 jprobes.

I ported kprobes to 2.6.6 and the same performance patterns appeared on the formerly working code, with the only conclusion left that kprobes is not suitable for this kind of performance measurements under very high loads.

Hi,


I wrote some simple tests to check the overhead of kprobes and jprobes. I have run them on an athlon and pentium III machine, but I haven't run them on an pentium IV. It could be the costs are higher on Pentium IV. Could you give these a try on your pentium IV machine? The following URL has an attachment with software for measuring overhead:

http://sources.redhat.com/ml/systemtap/current/msg00093.html

Also is an smp kernel or premption being used? The current locking mechanism in kprobes serializes multiple kprobes. Is it being possible that some of the overhead could be due to serialization of the probes?

Another thing to consider is to use OProfile to get a better picture of where the overhead is. OProfile uses NMI interrupts and will be able to collect samples on where the processor is spending time in handling the kprobe. The tarball in the archived email above can collect oprofile data. However, you will want to have a kernel that supports the OProfile collecting data using the processors performance monitoring unit (PMU); either an SMP kernel or a UP kernel with CONFIG_X86_LOCAL_APIC set. The fallback timer interrupt mechanism used in UP kernels isn't going to get enough samples to get a good picture of what is going on.

The specifics for me is that the tests are running using dummynet network to simulate a very high speed long distance network (about 300ms rtt and 300Mbit/s bandwidth) so the packet rates are very high with BDP of about 8000 packets, i.e. lots of ack packets to process).

What kind of rate are the probes firing at? n*8000 probe firings per second? Could the delay introduced by the probes be affect behavior?


-Will

Baruch

[0] As a grad student, at least part of the idea is to have fun :-)

Even if you are not a graduate student the previous line holds. :)



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]