2062 – Return probes does not scale well on SMP box

Bug 2062 - Return probes does not scale well on SMP box

Summary: Return probes does not scale well on SMP box

Status:	RESOLVED FIXED

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	kprobes (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Srinivasa DS

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-12-16 01:09 UTC by Anil S Keshavamurthy
Modified:	2008-09-08 11:28 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Thoughts on finer-grained locking for return probes (1.35 KB, text/plain) 2006-04-18 18:09 UTC, Jim Keniston	Details
Finer-grained locking for kretprobes (3.49 KB, patch) 2006-06-30 23:35 UTC, Jim Keniston	Details \| Diff
patch to 2.6.17.4/ppc64 (2.80 KB, patch) 2006-07-07 09:24 UTC, Li Guanglei	Details \| Diff
testing data Jim's patch (12.21 KB, text/plain) 2006-07-07 09:26 UTC, Li Guanglei	Details
kprobes module - equivalent of 'probe syscall.getsid.return {}' (473 bytes, text/plain) 2006-07-07 23:55 UTC, Jim Keniston	Details
kretprobe scale spinlock (4.11 KB, patch) 2006-07-13 06:10 UTC, bibo,mao	Details \| Diff
test case for multi-thread contention kretprobe (1.48 KB, application/octet-stream) 2006-07-13 06:21 UTC, bibo,mao	Details
Patch on 2.6.23-rc2 (3.18 KB, patch) 2007-08-09 13:27 UTC, Srinivasa DS	Details \| Diff
Testcase to test above patch (193 bytes, text/plain) 2007-08-09 13:29 UTC, Srinivasa DS	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Anil S Keshavamurthy 2005-12-16 01:09:33 UTC

> I did a through code review for return probes and found that even though
> return probes are built on top of kprobes which is highly parallel, 
> return probes does *not* run in parallel :-(. 
> 
> On an SMP box, when multiple target function on different 
> cpu's tickles into trampline_probe_handler(), 
> all those thread on different cpus are serialized 
> via big spin_lock_irqsave(&kretprobe_lock, ...).

[Jim] Correct.

> Can we do some thing here to make return probes run in 
> parallel on SMP box? Any thoughts?
[Jim]
Yes, I think we can probably make it so kretprobe handlers can run in
parallel.  Each task is responsible for recycling its own
kretprobe_instances, so I don't think we have to hold a lock all
throughout trampoline_probe_handler().  No other task is going to sneak
in and recycle our task's kretprobe_instances.

One possibly tricky part is unregistering.  unregister_kretprobe()
assumes that every kretprobe_instance is either on the kretprobe's free
list or its used (in-use) list.  (If it's on the free list, it's
immediately freed.  Otherwise, its kretprobe pointer is set to NULL and
trampoline_probe_handler() frees it when the probed function returns.)
There are a couple of places where a kretprobe_instance is sort of in
transition, and doesn't really belong on either list.
1. While ri->rp->handler is being run, you can't clobber ri->rp, because
the handler may be using it.
2. In (TBD) user-space return probes, when we enter the probed function,
there's a moment where we've verified that there's a free
kretprobe_instance available to us, but we haven't yet copied and
replaced the return address in user space (not really an issue on ppc64
where the return addtress is in a register)... and we can't be
absolutely (100.0000000%) sure that those user-space accesses will
succeed.

Perhaps adding a state variable to struct kretprobe_instance would help
here.

Comment 1 Li Guanglei 2006-04-18 05:35:24 UTC

I did some testings and tried to find the potential places of SystemTap/kprobes
for performance improvement.

One of the results shown by these testings is about spin locks used by return probe.

Testing environment:

 RHEL 4 U2
 Kernel: 2.6.16.5
 SystemTap codes: cvs coded dated 04-17
 gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)
 ppc64, LPAR with 4 virtual CPUs (thus 8-way) and 2G memory

The data given by oprofile shows that spin locks routines used by kretprobe take
up a lot of time. 

As said by Anil, kretprobe_lock make the return probe serialized on multiple CPUs.

The testings show that the spin locks used by kretprobe is really too expensive.
For these testings, I use stpd-notReadRelayfs.patch to eliminate the influence
of data logging:

stpd-notReadRelayfs.patch:
--- ori/src/runtime/stpd/librelay.c	2006-04-08 17:59:36.000000000 -0400
+++ src/runtime/stpd/librelay.c	2006-04-13 15:22:31.000000000 -0400
@@ -330,6 +330,8 @@ static void *reader_thread(void *data)
 				strerror(errno));
 			rc = 0;
 		}
+		sleep(1);
+		continue;
 
 		rc = read(proc_file[cpu], &status[cpu].info,
 			  sizeof(struct buf_info));


*note*
addevent.syscall.entry & addevent.syscall.return are two event hooks defined in
LKET(Linux Kernel Event Trace tool). Basically these two event hooks will log
timestamp, pid, ppid, tid, cpuid, syscallname, hookid for the entry & return of
all system calls.

=============== Testing of addevent.syscall.entry ===============
specjbb score: 2218

oprofile.all:
samples  %        app name                 symbol name
3709035  67.0087  java                     (no symbols)
187750    3.3920  lglzy                    ._stp_vsnprintf
87234     1.5760  vmlinux-2.6.16.5.debug   .strpbrk
53092     0.9592  libjvm.so               cacheAllocWithInitialization
51339     0.9275  libc-2.3.4.so            memset
49893     0.9014  vmlinux-2.6.16.5.debug   restore
48598     0.8780  libjvm.so                fastLocalMark
47751     0.8627  lglzy                    ._stp_print_flush

vmstat:
---io---  --system-- ----cpu----
bi   bo   in    cs   us sy id wa
0     2   35   465   82 17  1  0
0    50   37   463   80 18  3  0
0    86   27   434   80 19  1  0
0     0   10   410   80 20  0  0
0    26   18   416   81 19  0  0

=============== Testing of addevent.syscall.return ==============
specjbb score: 1673

oprofile.all:
samples  %         app name                 symbol name
2701635  49.1351   java                     (no symbols)
832730   15.1450   vmlinux-2.6.16.5.debug   .__spin_yield
382374    6.9543   vmlinux-2.6.16.5.debug   ._spin_lock_irqsave
127950    2.3270   lglzy                    ._stp_vsnprintf
61853     1.1249   vmlinux-2.6.16.5.debug   .strpbrk
61002     1.1095   vmlinux-2.6.16.5.debug   restore
57893     1.0529   vmlinux-2.6.16.5.debug   .trampoline_probe_handler

vmstat:
----io--- --system-- ----cpu----
bi    bo  in    cs   us sy id wa
0    16   26   406   59 41  0  0
0     8   29   425   57 43  0  0
0     0   17   399   57 43  0  0

==========================================================
You can see that the score of syscall.entry is 2218, while the score of
syscall.return is only 1673.

vmstat shows that only ~19%CPU is in kernel for syscall.entry, while for
syscall.return ~42%CPU is spent in kernel.

The oprofile also shows that for syscall.return probe spin locks will eat up
~21% total CPU time, and only ~49% is spent on specjbb.

But for syscall.entry, ~67% CPU can be spent on specjbb, and the cost of spin
locks are negligible.

Comment 2 Jim Keniston 2006-04-18 18:09:40 UTC

Created attachment 974 [details]
Thoughts on finer-grained locking for return probes

I don't see how we can avoid significant lock overhead, because kretprobe data
structures are much more dynamic that kprobe data structures: we're constantly
moving kretprobe_instances back and forth between the kretprobe's free list and
kretprobe_inst_table[].

I've been thinking about this issue, though.  Attached are my current thoughts
on finer-grained locking.

Comment 3 James Dickens 2006-04-18 19:32:51 UTC

Subject: Re:  Return probes does not scale well on SMP box

On 18 Apr 2006 18:09:40 -0000, jkenisto at us dot ibm dot com
<sourceware-bugzilla@sourceware.org> wrote:
>
> ------- Additional Comments From jkenisto at us dot ibm dot com  2006-04-18 18:09 -------
> Created an attachment (id=974)
>  --> (http://sourceware.org/bugzilla/attachment.cgi?id=974&action=view)
> Thoughts on finer-grained locking for return probes
>
> I don't see how we can avoid significant lock overhead, because kretprobe data
> structures are much more dynamic that kprobe data structures: we're constantly
> moving kretprobe_instances back and forth between the kretprobe's free list and
> kretprobe_inst_table[].
>
> I've been thinking about this issue, though.  Attached are my current thoughts
> on finer-grained locking.
>

why not just store data per cpu/core, there is no way that two exit
probes can fire at the same time on the same cpu/core if you are in
the kernel and  inturupts are disabled. Then you just need one lock
for when you report the data from that cpus data structure. If a
return probe fires while someone else is accessing are data in the
cpu's structure we should give up and note the missed the probe fire
and return but this would be a very rare occurance in a properly
written script.

James Dickens
uadmin.blogspot.com

> --
>
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=2062
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
>

Comment 4 Jim Keniston 2006-04-18 20:22:35 UTC

(In reply to comment #3)
> why not just store data per cpu/core... ?

Hi, James.  Could you please clarify which kretprobe-related data structures you
envision as being per-cpu?  Thanks.

Comment 5 Anil S Keshavamurthy 2006-04-18 20:38:16 UTC

> why not just store data per cpu/core
Storing it in per cpu might not work here as in some cases the function enter 
might be called on one cpu and the by the time that function returns it could 
have migrated to a different cpu's and hence return is executed from different 
cpu. So we have to do global search for the return probe instance related to 
that perticular task and hence per cpu/core is not feasible.

However if we store the return probe instances in the task structure itself(by 
extending the task struct)then we can completly eliminate the locks for return 
probes and return probes can be made to run in parallel. We have to convince 
the linux community for extension of task struct.

-Anil

Comment 6 Jim Keniston 2006-04-18 22:04:42 UTC

(In reply to comment #5)
> > why not just store data per cpu/core
> Storing it in per cpu might not work here as in some cases the function enter 
> might be called on one cpu and the by the time that function returns it could 
> have migrated to a different cpu's and hence return is executed from different 
> cpu. So we have to do global search for the return probe instance related to 
> that perticular task and hence per cpu/core is not feasible.

Correct.  I suppose we could enable the caller to promise that the probed
function won't be preempted, won't sleep, and won't recurse, in which case we
could allocate an array of per-cpu kretprobe_instances and avoid the locking
associated with the free list.  What that would look like in SystemTap, I don't
know.

And there's still the problem of the array of kretprobe_instances possibly
having to stick around after unregister_kretprobe() returns and the kretprobe
disappears.  Can we manage that without a lock?

> 
> However if we store the return probe instances in the task structure itself(by 
> extending the task struct)then we can completly eliminate the locks for return 
> probes and return probes can be made to run in parallel...

Well, if we could add an hlist_head to struct task_struct (How big an "if" is
that?), we could throw out kretprobe_inst_table[] and the locking around that. 
But do you also envision a per-task pool of kretprobe_instances?  If so, how
big?  If not, aren't we back to locking some sort of free list, at least for the
sleeping and/or recursive functions?

I suppose we could avoid preallocating the kretprobe instances at all, and just
allocate them as needed from GFP_ATOMIC space. :-}

> 
> -Anil

Jim

Comment 7 Anil S Keshavamurthy 2006-04-18 23:48:28 UTC

> Correct.  I suppose we could enable the caller to promise that the probed
> function won't be preempted, won't sleep, and won't recurse, in which case we
> could allocate an array of per-cpu kretprobe_instances and avoid the locking
> associated with the free list.  What that would look like in SystemTap, I 
don't
> know.

Asking probed function not to preempt or sleep won;t work as we will be left 
with no return probes on most of the kernel functions.


> And there's still the problem of the array of kretprobe_instances possibly
> having to stick around after unregister_kretprobe() returns and the kretprobe
> disappears.  Can we manage that without a lock?

We can not avoid the lock if return probe instance (ri) is managed either 
globally or on per_cpu queue since may tasks will be running in parallel.

> > 
> > However if we store the return probe instances in the task structure itself
(by 
> > extending the task struct)then we can completly eliminate the locks for 
return 
> > probes and return probes can be made to run in parallel...

> Well, if we could add an hlist_head to struct task_struct (How big an "if" is
> that?), we could throw out kretprobe_inst_table[] and the locking around 
that. 
It is not a big deal, but some people might object that kprobes is touching 
core kernel data structure like task struct. Not sure about the penalty of 
this extra space in task struct. We need to submit a patch and see what 
community has to say.

> But do you also envision a per-task pool of kretprobe_instances?  If so, how
> big?  If not, aren't we back to locking some sort of free list, at least for 
the
> sleeping and/or recursive functions?

Having to allocate per-task pool would be very expensive, as every task that 
gets created will have to allocat the memory and release it when the task is 
dead no matter the retprobe gets fired or not.

I guess at the first step, it should be okay to have the lock in the kretprobe 
prepare stage(since other than moving the kretprobe instance from free list to 
used list, we don;t do much holding the lock)

> I suppose we could avoid preallocating the kretprobe instances at all, and 
just
> allocate them as needed from GFP_ATOMIC space. :-}
This could be other alternative, we can try this one too.


-Anil

Comment 8 James Dickens 2006-04-19 03:54:58 UTC

Subject: Re:  Return probes does not scale well on SMP box

[snipped]
> It is not a big deal, but some people might object that kprobes is touching
> core kernel data structure like task struct. Not sure about the penalty of
> this extra space in task struct. We need to submit a patch and see what
> community has to say.
>
perhaps it would be best to implement a new api that can be extended
add one pointer (to a linked list) to the task struct that many
projects can use as needed, so its just not for systemtap but any
futrue project can use it as well. this may be an easier pill for the
kernel guys to swallow knowing they won't have to have these special
requests comming in over time.

struct task_extensions
     int   signature;  /* each project would pick a signature to
locate its pointer in the future */
     sturct * private_project_data
     struct  *next_task_extension

> But do you also envision a per-task pool of kretprobe_instances?  If so, how
> > big?  If not, aren't we back to locking some sort of free list, at least for
> the
> > sleeping and/or recursive functions?

there are  32 cocurrent thread machines availible today.. in less than
2 year dual  64 concurrent thread (128 thread) boxes will be availible
on the market place you don't want to redesign this thing in 4 years
do you?

>
> Having to allocate per-task pool would be very expensive, as every task that
> gets created will have to allocat the memory and release it when the task is
> dead no matter the retprobe gets fired or not.

using the above mentioned shared private task struct you can recycle
as tasks are retired.  we could even use kprobes to catch task
creation and destroys to create the task structs.



>
> I guess at the first step, it should be okay to have the lock in the kretprobe
> prepare stage(since other than moving the kretprobe instance from free list to
> used list, we don;t do much holding the lock)
>
> > I suppose we could avoid preallocating the kretprobe instances at all, and
> just
> > allocate them as needed from GFP_ATOMIC space. :-}
> This could be other alternative, we can try this one too.
>
>
> -Anil
>
>
> --
>
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=2062
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
>

Comment 9 Jim Keniston 2006-04-19 21:44:17 UTC

(In reply to comment #7)
> > Correct.  I suppose we could enable the caller to promise that the probed
> > function won't be preempted, won't sleep, and won't recurse, in which case we
> > could allocate an array of per-cpu kretprobe_instances and avoid the locking
> > associated with the free list.  ...
> 
> Asking probed function not to preempt or sleep won;t work as we will be left 
> with no return probes on most of the kernel functions.

That's not what I meant.  By default, we would NOT assume a "simple" function 
(no preemption, no sleeping, not recursion)  By default, we wouldn't use per-cpu
kretprobe_instances.  But if the caller DOES specify that it's a simple function
(e.g., by setting maxactive to -2 or some such), maybe we could do the per-cpu
trick.

> 
> 
> > And there's still the problem of the array of kretprobe_instances possibly
> > having to stick around after unregister_kretprobe() returns and the kretprobe
> > disappears.  Can we manage that without a lock?
> 
> We can not avoid the lock if return probe instance (ri) is managed either 
> globally or on per_cpu queue since may tasks will be running in parallel.

Well, I wasn't thinking of a per-cpu queue, but rather a NR_CPUS-element array
of kretprobe_instances for each kretprobe that probes a "simple" function.  But
again, if/when we implement a new  approach, we need to pay special attention to
unregister_kretprobe(); that's a likely source of trouble.

...
> 
> > Well, if we could add an hlist_head to struct task_struct (How big an "if" is
> > that?), we could throw out kretprobe_inst_table[] and the locking around 
> that. 
> It is not a big deal, but some people might object that kprobes is touching 
> core kernel data structure like task struct. Not sure about the penalty of 
> this extra space in task struct. We need to submit a patch and see what 
> community has to say.

First verifying, of course, that that would indeed help.  (I think it would, but
we gotta prototype and test it to be sure.)

> 
> > But do you also envision a per-task pool of kretprobe_instances?  If so, how
> > big?  If not, aren't we back to locking some sort of free list, at least for 
> the
> > sleeping and/or recursive functions?
> 
> Having to allocate per-task pool would be very expensive, as every task that 
> gets created will have to allocat the memory and release it when the task is 
> dead no matter the retprobe gets fired or not.

I wasn't suggesting a per-task pool, but thought that maybe you were.

...

> -Anil

Jim

Comment 10 Frank Ch. Eigler 2006-04-19 21:56:55 UTC

If it were a matter only of accelerating *allocation* and *deallocation* of the
kretprobe_instances, sure, per-cpu free_instances pools would be handy.  But the
code also needs to deal with *lookup*, which needs to search cross-cpus thus a
shared used_instances.

Perhaps the single global kretprobe_lock could be replaced with per-kretprobe locks?

Comment 11 Jim Keniston 2006-04-19 23:53:11 UTC

(In reply to comment #8)
> [snipped]
> > It is not a big deal, but some people might object that kprobes is touching
> > core kernel data structure like task struct...
> >
> perhaps it would be best to implement a new api that can be extended
> add one pointer (to a linked list) to the task struct that many
> projects can use as needed, so its just not for systemtap but any
> futrue project can use it as well...

Interesting idea.  But in order for it to be useful for our particular purpose,
you'd have to be able to traverse the list without grabbing a lock.  That puts
some potentially troublesome restrictions on how and when the list can be modified.

> > But do you also envision a per-task pool of kretprobe_instances?  If so, how
> > > big?  If not, aren't we back to locking some sort of free list, at least for
> > the
> > > sleeping and/or recursive functions?
> 
> there are  32 cocurrent thread machines availible today.. in less than
> 2 year dual  64 concurrent thread (128 thread) boxes will be availible
> on the market place you don't want to redesign this thing in 4 years
> do you?

What's your point here?

Thanks.
Jim

Comment 12 Jim Keniston 2006-04-20 00:22:38 UTC

(In reply to comment #10)
> If it were a matter only of accelerating *allocation* and *deallocation* of the
> kretprobe_instances, sure, per-cpu free_instances pools would be handy.  But the
> code also needs to deal with *lookup*, which needs to search cross-cpus thus a
> shared used_instances.

Insertion, lookup, and removal of a kretprobe_instance is always done by the
task associated with the kretprobe_instance.  So if each task had its own list
of active kretprobe_instances, these activities wouldn't require locking.

Even given the current arrangement (global hash table, hashed by task), we could
lock only the current task's hash bucket, as proposed in Comment #2, and have
fairly low contention.

> 
> Perhaps the single global kretprobe_lock could be replaced with per-kretprobe
locks?

Yeah, Comment #2 proposes that.

Comment 13 Jim Keniston 2006-06-30 23:35:00 UTC

Created attachment 1132 [details]
Finer-grained locking for kretprobes

Here's a mostly-untested, i386-only patch that I put together in my spare time
quite a while ago.  I meant to test it on an SMP system before posting it, but
I haven't found the time.  But I don't want the work to go to waste, so here it
is.  The patch is described in its preamble.

Comment 14 Li Guanglei 2006-07-07 09:24:06 UTC

Created attachment 1146 [details]
patch to 2.6.17.4/ppc64

Comment 15 Li Guanglei 2006-07-07 09:26:46 UTC

Created attachment 1147 [details]
testing data Jim's patch

Hi Jim,
  I manually applied your patch to 2.6.17.4/ppc64, and tested on a 8-way ppc64
box. I use a multi-thread app which will create 8 threads and each thread will
call getsid() in a loop. The tests shows that the kretprobes still doesn't
scale well on SMP box. Here is the results:

<1> without being probed by stap:

root:/home/root/mt-getsid> ./app
cpu 5: loops=5000000, average=5566 ns
cpu 4: loops=5000000, average=5600 ns
cpu 0: loops=5000000, average=5868 ns
cpu 1: loops=5000000, average=5884 ns
cpu 6: loops=5000000, average=5886 ns
cpu 7: loops=5000000, average=5893 ns
cpu 3: loops=5000000, average=5937 ns
cpu 2: loops=5000000, average=5938 ns
Total cpus: loops = 40000000, average = 5822 ns

<2> probed by 'stap -e "probe syscall.getsid {}" -bM'

root:/home/root/mt-getsid> ./app
cpu 5: loops=5000000, average=7685 ns
cpu 4: loops=5000000, average=7685 ns
cpu 3: loops=5000000, average=7686 ns
cpu 1: loops=5000000, average=7685 ns
cpu 2: loops=5000000, average=7689 ns
cpu 6: loops=5000000, average=7690 ns
cpu 7: loops=5000000, average=7690 ns
cpu 0: loops=5000000, average=7694 ns
Total cpus: loops = 40000000, average = 7688 ns

<3> probed by 'stap -e "probe syscall.getsid.return {}" -bM'

root:/home/root/mt-getsid> ./app
cpu 4: loops=5000000, average=19807 ns
cpu 7: loops=5000000, average=19869 ns
cpu 1: loops=5000000, average=24322 ns
cpu 0: loops=5000000, average=24326 ns
cpu 3: loops=5000000, average=27886 ns
cpu 2: loops=5000000, average=27991 ns
cpu 6: loops=5000000, average=28986 ns
cpu 5: loops=5000000, average=29034 ns
Total cpus: loops = 40000000, average = 25277 ns

I also attached the patch I used, hope that I didn't miss something.

I ever used oprofile to sample the kretprobe, and the sample data is similar
with the data in this attachment.

Comment 16 Jim Keniston 2006-07-07 17:46:54 UTC

(In reply to comment #14)
> Created an attachment (id=1146)
> patch to 2.6.17.4/ppc64
> 

This looks correct -- Thanks! -- except that it appears you've introduced
spaces, in some places, where there should be tabs.  Grep for "       " (7
consecutive spaces).

Comment 17 Jim Keniston 2006-07-07 23:32:40 UTC

(In reply to comment #15)
> Created an attachment (id=1147)
> testing data Jim's patch
> 
> Hi Jim,
>   I manually applied your patch to 2.6.17.4/ppc64, and tested on a 8-way ppc64
> box. I use a multi-thread app which will create 8 threads and each thread will
> call getsid() in a loop. The tests shows that the kretprobes still doesn't
> scale well on SMP box. Here is the results:
> 
> <1> without being probed by stap:
...
> Total cpus: loops = 40000000, average = 5822 ns
> 
> <2> probed by 'stap -e "probe syscall.getsid {}" -bM'
...
> Total cpus: loops = 40000000, average = 7688 ns
> 
> <3> probed by 'stap -e "probe syscall.getsid.return {}" -bM'
...
> Total cpus: loops = 40000000, average = 25277 ns
> 

This is troubling.  A kretprobe is more expensive than a kprobe, but it
shouldn't be that much more.  Given uniprocessor performance ratios, I'd expect
numbers in the range of 8,000-10,000 ns, not 25,000 -- if lock contention
weren't a major factor.

What numbers do you get when you run the syscall.getsid.return test with the
"old" version of kretprobes?

> I ever used oprofile to sample the kretprobe, and the sample data is similar
> with the data in this attachment.
> 

The high number for .__spin_yield() confirms that there's significant lock
contention, and the high number for ._spin_lock_irqsave() (compared to
._spin_lock()) suggests that the contention is on the hash-bucket locks
(kretprobe_table_locks[]) rather than the per-kretprobe lock.  This is a
surprise to me.

My version of stap doesn't appear to create any calls to spin_lock_irqsave() in
the handlers.  Could you please verify that that's true for your version?  Run
stap with -p3 and verify that there are no calls to spin_lock_irqsave()?

I did some experimentation and verified that hashing on current() provides a
reasonably good distribution of hash indexes.  That is, a bunch of tasks
launched together but running concurrently seem to hash to different buckets. 
(If you run them consecutively, they tend to re-use the same task_struct and
therefore the same hash bucket... but if they're not concurrent, there'd be no
contention, right?)

A good thing to try next is to factor SystemTap out of the experiment.  I'll
attach a .c file that's equivalent to "probe syscall.getsid.return {}".

Comment 18 Jim Keniston 2006-07-07 23:55:22 UTC

Created attachment 1149 [details]
kprobes module - equivalent of 'probe syscall.getsid.return {}'

Guanglei, could you please re-run your test with this module inserted, rather
than the SystemTap script running?  Thanks.

Comment 19 Li Guanglei 2006-07-08 05:14:08 UTC

(In reply to comment #18)
> 
> This looks correct -- Thanks! -- except that it appears you've introduced
> spaces, in some places, where there should be tabs.  Grep for "       " (7
> consecutive spaces).

sorry, because I used mouse to copy/paste your codes

(In reply to comment #18)
> Created an attachment (id=1149)
> kprobes module - equivalent of 'probe syscall.getsid.return {}'
> 
> Guanglei, could you please re-run your test with this module inserted, rather
> than the SystemTap script running?  Thanks.

I tested using stap and the module generated from getsid.c separately. I ran
each test about 4 times,  here are the results:

====== 2.6.17.3/ppc64/8-way, without Jim's patch ================

no probe
  Total cpus: loops = 40000000, average = 6202 ns

kretprobe using stap:
  Total cpus: loops = 40000000, average = 43702 ns

kretprobe using getsid.c:
  Total cpus: loops = 40000000, average = 36456 ns


======= 2.6.17.4/ppc64/8-way, with Jim's patch ===================

kretprobe using stap:
  Total cpus: loops = 40000000, average = 26621 ns

kretprobe using getsid.c:
  Total cpus: loops = 40000000, average = 24975 ns

==================================================================

The tests shows that the patch could help the performance improvement( ~30%
improvement), but kretprobe is still expensive.

Comment 20 bibo,mao 2006-07-11 10:45:12 UTC

there possibly exists recursive deadlock about kretprobe, recycle_rp_inst() is 
called with  ri's hash-bucket locked, it maybe walks kfree(ri) function patch, 
if kfree funciton is probed also by kretprobe. 

And when kfree kretprobe is hit, it will call hash_rp_inst() function, hence 
recursive deadlock will occur. This problem also exists on baseline kernel 
version.

Comment 21 Ananth Mavinakayanahalli 2006-07-11 11:14:53 UTC

Bibo, This issue was addressed in the mainline
(ref:http://sources.redhat.com/bugzilla/show_bug.cgi?id=2162)

Comment 22 bibo,mao 2006-07-12 04:34:58 UTC

Maybe I did not clarify that, kfree is called in two places, one is by 
free_rp_inst() function, the other is by recycle_rp_inst() function,
http://sources.redhat.com/bugzilla/show_bug.cgi?id=2162 is the issue about 
kfree_rp_inst(), but recycle_rp_inst also calls kfree function and 
recycle_rp_inst is called with kretprobe_lock held.

Comment 23 bibo,mao 2006-07-13 06:10:47 UTC

Created attachment 1156 [details]
kretprobe scale spinlock

Comment 24 bibo,mao 2006-07-13 06:12:18 UTC

I refreshed the patch in the attachment, and test it in IA64, test result shows
that it improved multi-kprobe performance greatly if there are multi spinlock
for each hash listhead. There are mainly two points about this patch:
1) change hash method, in general task_struct point is page size alignment, here
changed as hash_ptr((void*)((unsigned long)tsk >> PAGE_SHIFT), bits)

2) add kretprobe instance to empty list if ri->rp==0, and free ri outof spinlock
scope.

The test case in the attachment is for IA64, it need modification for other
architecture.

Comment 25 bibo,mao 2006-07-13 06:21:17 UTC

Created attachment 1157 [details]
test case for multi-thread contention kretprobe

This patch tests kretprobe performance by getpid() system call, generally it
will link tls-libc library where pid is cache in thread info and will generate
sys_getpid system call for later times.
So I compile getpid.c with --static option, it will be good if there is better
method.

Comment 26 Jim Keniston 2006-07-13 19:44:01 UTC

(In reply to comment #19)

> ====== 2.6.17.3/ppc64/8-way, without Jim's patch ================
> 
> no probe
>   Total cpus: loops = 40000000, average = 6202 ns
> 
> kretprobe using stap:
>   Total cpus: loops = 40000000, average = 43702 ns
> 
> kretprobe using getsid.c:
>   Total cpus: loops = 40000000, average = 36456 ns
> 
> 
> ======= 2.6.17.4/ppc64/8-way, with Jim's patch ===================
> 
> kretprobe using stap:
>   Total cpus: loops = 40000000, average = 26621 ns
> 
> kretprobe using getsid.c:
>   Total cpus: loops = 40000000, average = 24975 ns

This is actually pretty much what I'd expect to see.  All the CPUs are hitting
the same probepoint repeatedly and piling up on the same per-kretprobe lock. 
But the performance gain is reasonably good, and even better when stap's
involved.  The stap-generated handler takes longer than the empty kprobes
handler, so we get more benefit from not holding a lock while the handler runs.

We should see more benefit with multiple kretprobes -- e.g.
probe syscall.*.return { ... }

Keep the comments and improvements coming.  Thanks.

Comment 27 bibo,mao 2006-07-14 08:12:52 UTC

Jim,
   would you like to arrange one clean patch for recent kernel version?
I will test it on x86 and IA64 platform, basically this method has good 
performance improvement for kretprobe.

Comment 28 Jim Keniston 2006-11-16 01:08:20 UTC

This has been sitting on the shelf, mostly due to my inability to find time to
work on it.  It appears that we have working versions for x86, ia64, and ppc64.
Ports to other platforms should be just as easy.   We see some performance
improvements, but we need to verify that improvements are as significant as
expected for cases like
probe kernel.syscall.*.return { ... }

Anybody who has time and inclination to pick this up, please feel free.

Comment 29 Ananth Mavinakayanahalli 2007-02-21 08:36:49 UTC

Jim,
I've spoken to Srinivasa and he has agreed to take a look at this bug.
Reassigning to him.

Ananth

Comment 30 Srinivasa DS 2007-02-27 14:37:10 UTC

I have applied the patch on latest kernel with some modifications. Iam running
tests on it.
Testcase attached in comment#25 by bibo is working  fine and but Iam having
problem in syscall.*.return{....} stapscript. 
Iam getting soft-lockup messages. Debugging further .....

Thanks 
 Srinivasa DS

Comment 31 Srinivasa DS 2007-08-09 13:27:52 UTC

Created attachment 1954 [details]
Patch on 2.6.23-rc2

Attached patch implements Jim's idea on fine-grain locking.
 I have done some testing on this patch using "probe syscall.*.return"
systemtap scripts(Iam attaching it here).
I will ask Jim, Ananth, Prasanna, Anil to review this patch and update the bug.

Comment 32 Srinivasa DS 2007-08-09 13:29:36 UTC

Created attachment 1955 [details]
Testcase to test above patch

Comment 33 Srinivasa DS 2008-09-08 11:28:43 UTC

Patch has been proposed in lkml and it is part of the upstream kernel. So
closing the bug.

Thanks
 Srinivasa DS