> I did a through code review for return probes and found that even though > return probes are built on top of kprobes which is highly parallel, > return probes does *not* run in parallel :-(. > > On an SMP box, when multiple target function on different > cpu's tickles into trampline_probe_handler(), > all those thread on different cpus are serialized > via big spin_lock_irqsave(&kretprobe_lock, ...). [Jim] Correct. > Can we do some thing here to make return probes run in > parallel on SMP box? Any thoughts? [Jim] Yes, I think we can probably make it so kretprobe handlers can run in parallel. Each task is responsible for recycling its own kretprobe_instances, so I don't think we have to hold a lock all throughout trampoline_probe_handler(). No other task is going to sneak in and recycle our task's kretprobe_instances. One possibly tricky part is unregistering. unregister_kretprobe() assumes that every kretprobe_instance is either on the kretprobe's free list or its used (in-use) list. (If it's on the free list, it's immediately freed. Otherwise, its kretprobe pointer is set to NULL and trampoline_probe_handler() frees it when the probed function returns.) There are a couple of places where a kretprobe_instance is sort of in transition, and doesn't really belong on either list. 1. While ri->rp->handler is being run, you can't clobber ri->rp, because the handler may be using it. 2. In (TBD) user-space return probes, when we enter the probed function, there's a moment where we've verified that there's a free kretprobe_instance available to us, but we haven't yet copied and replaced the return address in user space (not really an issue on ppc64 where the return addtress is in a register)... and we can't be absolutely (100.0000000%) sure that those user-space accesses will succeed. Perhaps adding a state variable to struct kretprobe_instance would help here.
I did some testings and tried to find the potential places of SystemTap/kprobes for performance improvement. One of the results shown by these testings is about spin locks used by return probe. Testing environment: RHEL 4 U2 Kernel: 2.6.16.5 SystemTap codes: cvs coded dated 04-17 gcc version 3.4.5 20051201 (Red Hat 3.4.5-2) ppc64, LPAR with 4 virtual CPUs (thus 8-way) and 2G memory The data given by oprofile shows that spin locks routines used by kretprobe take up a lot of time. As said by Anil, kretprobe_lock make the return probe serialized on multiple CPUs. The testings show that the spin locks used by kretprobe is really too expensive. For these testings, I use stpd-notReadRelayfs.patch to eliminate the influence of data logging: stpd-notReadRelayfs.patch: --- ori/src/runtime/stpd/librelay.c 2006-04-08 17:59:36.000000000 -0400 +++ src/runtime/stpd/librelay.c 2006-04-13 15:22:31.000000000 -0400 @@ -330,6 +330,8 @@ static void *reader_thread(void *data) strerror(errno)); rc = 0; } + sleep(1); + continue; rc = read(proc_file[cpu], &status[cpu].info, sizeof(struct buf_info)); *note* addevent.syscall.entry & addevent.syscall.return are two event hooks defined in LKET(Linux Kernel Event Trace tool). Basically these two event hooks will log timestamp, pid, ppid, tid, cpuid, syscallname, hookid for the entry & return of all system calls. =============== Testing of addevent.syscall.entry =============== specjbb score: 2218 oprofile.all: samples % app name symbol name 3709035 67.0087 java (no symbols) 187750 3.3920 lglzy ._stp_vsnprintf 87234 1.5760 vmlinux-2.6.16.5.debug .strpbrk 53092 0.9592 libjvm.so cacheAllocWithInitialization 51339 0.9275 libc-2.3.4.so memset 49893 0.9014 vmlinux-2.6.16.5.debug restore 48598 0.8780 libjvm.so fastLocalMark 47751 0.8627 lglzy ._stp_print_flush vmstat: ---io--- --system-- ----cpu---- bi bo in cs us sy id wa 0 2 35 465 82 17 1 0 0 50 37 463 80 18 3 0 0 86 27 434 80 19 1 0 0 0 10 410 80 20 0 0 0 26 18 416 81 19 0 0 =============== Testing of addevent.syscall.return ============== specjbb score: 1673 oprofile.all: samples % app name symbol name 2701635 49.1351 java (no symbols) 832730 15.1450 vmlinux-2.6.16.5.debug .__spin_yield 382374 6.9543 vmlinux-2.6.16.5.debug ._spin_lock_irqsave 127950 2.3270 lglzy ._stp_vsnprintf 61853 1.1249 vmlinux-2.6.16.5.debug .strpbrk 61002 1.1095 vmlinux-2.6.16.5.debug restore 57893 1.0529 vmlinux-2.6.16.5.debug .trampoline_probe_handler vmstat: ----io--- --system-- ----cpu---- bi bo in cs us sy id wa 0 16 26 406 59 41 0 0 0 8 29 425 57 43 0 0 0 0 17 399 57 43 0 0 ========================================================== You can see that the score of syscall.entry is 2218, while the score of syscall.return is only 1673. vmstat shows that only ~19%CPU is in kernel for syscall.entry, while for syscall.return ~42%CPU is spent in kernel. The oprofile also shows that for syscall.return probe spin locks will eat up ~21% total CPU time, and only ~49% is spent on specjbb. But for syscall.entry, ~67% CPU can be spent on specjbb, and the cost of spin locks are negligible.
Created attachment 974 [details] Thoughts on finer-grained locking for return probes I don't see how we can avoid significant lock overhead, because kretprobe data structures are much more dynamic that kprobe data structures: we're constantly moving kretprobe_instances back and forth between the kretprobe's free list and kretprobe_inst_table[]. I've been thinking about this issue, though. Attached are my current thoughts on finer-grained locking.
Subject: Re: Return probes does not scale well on SMP box On 18 Apr 2006 18:09:40 -0000, jkenisto at us dot ibm dot com <sourceware-bugzilla@sourceware.org> wrote: > > ------- Additional Comments From jkenisto at us dot ibm dot com 2006-04-18 18:09 ------- > Created an attachment (id=974) > --> (http://sourceware.org/bugzilla/attachment.cgi?id=974&action=view) > Thoughts on finer-grained locking for return probes > > I don't see how we can avoid significant lock overhead, because kretprobe data > structures are much more dynamic that kprobe data structures: we're constantly > moving kretprobe_instances back and forth between the kretprobe's free list and > kretprobe_inst_table[]. > > I've been thinking about this issue, though. Attached are my current thoughts > on finer-grained locking. > why not just store data per cpu/core, there is no way that two exit probes can fire at the same time on the same cpu/core if you are in the kernel and inturupts are disabled. Then you just need one lock for when you report the data from that cpus data structure. If a return probe fires while someone else is accessing are data in the cpu's structure we should give up and note the missed the probe fire and return but this would be a very rare occurance in a properly written script. James Dickens uadmin.blogspot.com > -- > > > http://sourceware.org/bugzilla/show_bug.cgi?id=2062 > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. >
(In reply to comment #3) > why not just store data per cpu/core... ? Hi, James. Could you please clarify which kretprobe-related data structures you envision as being per-cpu? Thanks.
> why not just store data per cpu/core Storing it in per cpu might not work here as in some cases the function enter might be called on one cpu and the by the time that function returns it could have migrated to a different cpu's and hence return is executed from different cpu. So we have to do global search for the return probe instance related to that perticular task and hence per cpu/core is not feasible. However if we store the return probe instances in the task structure itself(by extending the task struct)then we can completly eliminate the locks for return probes and return probes can be made to run in parallel. We have to convince the linux community for extension of task struct. -Anil
(In reply to comment #5) > > why not just store data per cpu/core > Storing it in per cpu might not work here as in some cases the function enter > might be called on one cpu and the by the time that function returns it could > have migrated to a different cpu's and hence return is executed from different > cpu. So we have to do global search for the return probe instance related to > that perticular task and hence per cpu/core is not feasible. Correct. I suppose we could enable the caller to promise that the probed function won't be preempted, won't sleep, and won't recurse, in which case we could allocate an array of per-cpu kretprobe_instances and avoid the locking associated with the free list. What that would look like in SystemTap, I don't know. And there's still the problem of the array of kretprobe_instances possibly having to stick around after unregister_kretprobe() returns and the kretprobe disappears. Can we manage that without a lock? > > However if we store the return probe instances in the task structure itself(by > extending the task struct)then we can completly eliminate the locks for return > probes and return probes can be made to run in parallel... Well, if we could add an hlist_head to struct task_struct (How big an "if" is that?), we could throw out kretprobe_inst_table[] and the locking around that. But do you also envision a per-task pool of kretprobe_instances? If so, how big? If not, aren't we back to locking some sort of free list, at least for the sleeping and/or recursive functions? I suppose we could avoid preallocating the kretprobe instances at all, and just allocate them as needed from GFP_ATOMIC space. :-} > > -Anil Jim
> Correct. I suppose we could enable the caller to promise that the probed > function won't be preempted, won't sleep, and won't recurse, in which case we > could allocate an array of per-cpu kretprobe_instances and avoid the locking > associated with the free list. What that would look like in SystemTap, I don't > know. Asking probed function not to preempt or sleep won;t work as we will be left with no return probes on most of the kernel functions. > And there's still the problem of the array of kretprobe_instances possibly > having to stick around after unregister_kretprobe() returns and the kretprobe > disappears. Can we manage that without a lock? We can not avoid the lock if return probe instance (ri) is managed either globally or on per_cpu queue since may tasks will be running in parallel. > > > > However if we store the return probe instances in the task structure itself (by > > extending the task struct)then we can completly eliminate the locks for return > > probes and return probes can be made to run in parallel... > Well, if we could add an hlist_head to struct task_struct (How big an "if" is > that?), we could throw out kretprobe_inst_table[] and the locking around that. It is not a big deal, but some people might object that kprobes is touching core kernel data structure like task struct. Not sure about the penalty of this extra space in task struct. We need to submit a patch and see what community has to say. > But do you also envision a per-task pool of kretprobe_instances? If so, how > big? If not, aren't we back to locking some sort of free list, at least for the > sleeping and/or recursive functions? Having to allocate per-task pool would be very expensive, as every task that gets created will have to allocat the memory and release it when the task is dead no matter the retprobe gets fired or not. I guess at the first step, it should be okay to have the lock in the kretprobe prepare stage(since other than moving the kretprobe instance from free list to used list, we don;t do much holding the lock) > I suppose we could avoid preallocating the kretprobe instances at all, and just > allocate them as needed from GFP_ATOMIC space. :-} This could be other alternative, we can try this one too. -Anil
Subject: Re: Return probes does not scale well on SMP box [snipped] > It is not a big deal, but some people might object that kprobes is touching > core kernel data structure like task struct. Not sure about the penalty of > this extra space in task struct. We need to submit a patch and see what > community has to say. > perhaps it would be best to implement a new api that can be extended add one pointer (to a linked list) to the task struct that many projects can use as needed, so its just not for systemtap but any futrue project can use it as well. this may be an easier pill for the kernel guys to swallow knowing they won't have to have these special requests comming in over time. struct task_extensions int signature; /* each project would pick a signature to locate its pointer in the future */ sturct * private_project_data struct *next_task_extension > But do you also envision a per-task pool of kretprobe_instances? If so, how > > big? If not, aren't we back to locking some sort of free list, at least for > the > > sleeping and/or recursive functions? there are 32 cocurrent thread machines availible today.. in less than 2 year dual 64 concurrent thread (128 thread) boxes will be availible on the market place you don't want to redesign this thing in 4 years do you? > > Having to allocate per-task pool would be very expensive, as every task that > gets created will have to allocat the memory and release it when the task is > dead no matter the retprobe gets fired or not. using the above mentioned shared private task struct you can recycle as tasks are retired. we could even use kprobes to catch task creation and destroys to create the task structs. > > I guess at the first step, it should be okay to have the lock in the kretprobe > prepare stage(since other than moving the kretprobe instance from free list to > used list, we don;t do much holding the lock) > > > I suppose we could avoid preallocating the kretprobe instances at all, and > just > > allocate them as needed from GFP_ATOMIC space. :-} > This could be other alternative, we can try this one too. > > > -Anil > > > -- > > > http://sourceware.org/bugzilla/show_bug.cgi?id=2062 > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. >
(In reply to comment #7) > > Correct. I suppose we could enable the caller to promise that the probed > > function won't be preempted, won't sleep, and won't recurse, in which case we > > could allocate an array of per-cpu kretprobe_instances and avoid the locking > > associated with the free list. ... > > Asking probed function not to preempt or sleep won;t work as we will be left > with no return probes on most of the kernel functions. That's not what I meant. By default, we would NOT assume a "simple" function (no preemption, no sleeping, not recursion) By default, we wouldn't use per-cpu kretprobe_instances. But if the caller DOES specify that it's a simple function (e.g., by setting maxactive to -2 or some such), maybe we could do the per-cpu trick. > > > > And there's still the problem of the array of kretprobe_instances possibly > > having to stick around after unregister_kretprobe() returns and the kretprobe > > disappears. Can we manage that without a lock? > > We can not avoid the lock if return probe instance (ri) is managed either > globally or on per_cpu queue since may tasks will be running in parallel. Well, I wasn't thinking of a per-cpu queue, but rather a NR_CPUS-element array of kretprobe_instances for each kretprobe that probes a "simple" function. But again, if/when we implement a new approach, we need to pay special attention to unregister_kretprobe(); that's a likely source of trouble. ... > > > Well, if we could add an hlist_head to struct task_struct (How big an "if" is > > that?), we could throw out kretprobe_inst_table[] and the locking around > that. > It is not a big deal, but some people might object that kprobes is touching > core kernel data structure like task struct. Not sure about the penalty of > this extra space in task struct. We need to submit a patch and see what > community has to say. First verifying, of course, that that would indeed help. (I think it would, but we gotta prototype and test it to be sure.) > > > But do you also envision a per-task pool of kretprobe_instances? If so, how > > big? If not, aren't we back to locking some sort of free list, at least for > the > > sleeping and/or recursive functions? > > Having to allocate per-task pool would be very expensive, as every task that > gets created will have to allocat the memory and release it when the task is > dead no matter the retprobe gets fired or not. I wasn't suggesting a per-task pool, but thought that maybe you were. ... > -Anil Jim
If it were a matter only of accelerating *allocation* and *deallocation* of the kretprobe_instances, sure, per-cpu free_instances pools would be handy. But the code also needs to deal with *lookup*, which needs to search cross-cpus thus a shared used_instances. Perhaps the single global kretprobe_lock could be replaced with per-kretprobe locks?
(In reply to comment #8) > [snipped] > > It is not a big deal, but some people might object that kprobes is touching > > core kernel data structure like task struct... > > > perhaps it would be best to implement a new api that can be extended > add one pointer (to a linked list) to the task struct that many > projects can use as needed, so its just not for systemtap but any > futrue project can use it as well... Interesting idea. But in order for it to be useful for our particular purpose, you'd have to be able to traverse the list without grabbing a lock. That puts some potentially troublesome restrictions on how and when the list can be modified. > > But do you also envision a per-task pool of kretprobe_instances? If so, how > > > big? If not, aren't we back to locking some sort of free list, at least for > > the > > > sleeping and/or recursive functions? > > there are 32 cocurrent thread machines availible today.. in less than > 2 year dual 64 concurrent thread (128 thread) boxes will be availible > on the market place you don't want to redesign this thing in 4 years > do you? What's your point here? Thanks. Jim
(In reply to comment #10) > If it were a matter only of accelerating *allocation* and *deallocation* of the > kretprobe_instances, sure, per-cpu free_instances pools would be handy. But the > code also needs to deal with *lookup*, which needs to search cross-cpus thus a > shared used_instances. Insertion, lookup, and removal of a kretprobe_instance is always done by the task associated with the kretprobe_instance. So if each task had its own list of active kretprobe_instances, these activities wouldn't require locking. Even given the current arrangement (global hash table, hashed by task), we could lock only the current task's hash bucket, as proposed in Comment #2, and have fairly low contention. > > Perhaps the single global kretprobe_lock could be replaced with per-kretprobe locks? Yeah, Comment #2 proposes that.
Created attachment 1132 [details] Finer-grained locking for kretprobes Here's a mostly-untested, i386-only patch that I put together in my spare time quite a while ago. I meant to test it on an SMP system before posting it, but I haven't found the time. But I don't want the work to go to waste, so here it is. The patch is described in its preamble.
Created attachment 1146 [details] patch to 2.6.17.4/ppc64
Created attachment 1147 [details] testing data Jim's patch Hi Jim, I manually applied your patch to 2.6.17.4/ppc64, and tested on a 8-way ppc64 box. I use a multi-thread app which will create 8 threads and each thread will call getsid() in a loop. The tests shows that the kretprobes still doesn't scale well on SMP box. Here is the results: <1> without being probed by stap: root:/home/root/mt-getsid> ./app cpu 5: loops=5000000, average=5566 ns cpu 4: loops=5000000, average=5600 ns cpu 0: loops=5000000, average=5868 ns cpu 1: loops=5000000, average=5884 ns cpu 6: loops=5000000, average=5886 ns cpu 7: loops=5000000, average=5893 ns cpu 3: loops=5000000, average=5937 ns cpu 2: loops=5000000, average=5938 ns Total cpus: loops = 40000000, average = 5822 ns <2> probed by 'stap -e "probe syscall.getsid {}" -bM' root:/home/root/mt-getsid> ./app cpu 5: loops=5000000, average=7685 ns cpu 4: loops=5000000, average=7685 ns cpu 3: loops=5000000, average=7686 ns cpu 1: loops=5000000, average=7685 ns cpu 2: loops=5000000, average=7689 ns cpu 6: loops=5000000, average=7690 ns cpu 7: loops=5000000, average=7690 ns cpu 0: loops=5000000, average=7694 ns Total cpus: loops = 40000000, average = 7688 ns <3> probed by 'stap -e "probe syscall.getsid.return {}" -bM' root:/home/root/mt-getsid> ./app cpu 4: loops=5000000, average=19807 ns cpu 7: loops=5000000, average=19869 ns cpu 1: loops=5000000, average=24322 ns cpu 0: loops=5000000, average=24326 ns cpu 3: loops=5000000, average=27886 ns cpu 2: loops=5000000, average=27991 ns cpu 6: loops=5000000, average=28986 ns cpu 5: loops=5000000, average=29034 ns Total cpus: loops = 40000000, average = 25277 ns I also attached the patch I used, hope that I didn't miss something. I ever used oprofile to sample the kretprobe, and the sample data is similar with the data in this attachment.
(In reply to comment #14) > Created an attachment (id=1146) > patch to 2.6.17.4/ppc64 > This looks correct -- Thanks! -- except that it appears you've introduced spaces, in some places, where there should be tabs. Grep for " " (7 consecutive spaces).
(In reply to comment #15) > Created an attachment (id=1147) > testing data Jim's patch > > Hi Jim, > I manually applied your patch to 2.6.17.4/ppc64, and tested on a 8-way ppc64 > box. I use a multi-thread app which will create 8 threads and each thread will > call getsid() in a loop. The tests shows that the kretprobes still doesn't > scale well on SMP box. Here is the results: > > <1> without being probed by stap: ... > Total cpus: loops = 40000000, average = 5822 ns > > <2> probed by 'stap -e "probe syscall.getsid {}" -bM' ... > Total cpus: loops = 40000000, average = 7688 ns > > <3> probed by 'stap -e "probe syscall.getsid.return {}" -bM' ... > Total cpus: loops = 40000000, average = 25277 ns > This is troubling. A kretprobe is more expensive than a kprobe, but it shouldn't be that much more. Given uniprocessor performance ratios, I'd expect numbers in the range of 8,000-10,000 ns, not 25,000 -- if lock contention weren't a major factor. What numbers do you get when you run the syscall.getsid.return test with the "old" version of kretprobes? > I ever used oprofile to sample the kretprobe, and the sample data is similar > with the data in this attachment. > The high number for .__spin_yield() confirms that there's significant lock contention, and the high number for ._spin_lock_irqsave() (compared to ._spin_lock()) suggests that the contention is on the hash-bucket locks (kretprobe_table_locks[]) rather than the per-kretprobe lock. This is a surprise to me. My version of stap doesn't appear to create any calls to spin_lock_irqsave() in the handlers. Could you please verify that that's true for your version? Run stap with -p3 and verify that there are no calls to spin_lock_irqsave()? I did some experimentation and verified that hashing on current() provides a reasonably good distribution of hash indexes. That is, a bunch of tasks launched together but running concurrently seem to hash to different buckets. (If you run them consecutively, they tend to re-use the same task_struct and therefore the same hash bucket... but if they're not concurrent, there'd be no contention, right?) A good thing to try next is to factor SystemTap out of the experiment. I'll attach a .c file that's equivalent to "probe syscall.getsid.return {}".
Created attachment 1149 [details] kprobes module - equivalent of 'probe syscall.getsid.return {}' Guanglei, could you please re-run your test with this module inserted, rather than the SystemTap script running? Thanks.
(In reply to comment #18) > > This looks correct -- Thanks! -- except that it appears you've introduced > spaces, in some places, where there should be tabs. Grep for " " (7 > consecutive spaces). sorry, because I used mouse to copy/paste your codes (In reply to comment #18) > Created an attachment (id=1149) > kprobes module - equivalent of 'probe syscall.getsid.return {}' > > Guanglei, could you please re-run your test with this module inserted, rather > than the SystemTap script running? Thanks. I tested using stap and the module generated from getsid.c separately. I ran each test about 4 times, here are the results: ====== 2.6.17.3/ppc64/8-way, without Jim's patch ================ no probe Total cpus: loops = 40000000, average = 6202 ns kretprobe using stap: Total cpus: loops = 40000000, average = 43702 ns kretprobe using getsid.c: Total cpus: loops = 40000000, average = 36456 ns ======= 2.6.17.4/ppc64/8-way, with Jim's patch =================== kretprobe using stap: Total cpus: loops = 40000000, average = 26621 ns kretprobe using getsid.c: Total cpus: loops = 40000000, average = 24975 ns ================================================================== The tests shows that the patch could help the performance improvement( ~30% improvement), but kretprobe is still expensive.
there possibly exists recursive deadlock about kretprobe, recycle_rp_inst() is called with ri's hash-bucket locked, it maybe walks kfree(ri) function patch, if kfree funciton is probed also by kretprobe. And when kfree kretprobe is hit, it will call hash_rp_inst() function, hence recursive deadlock will occur. This problem also exists on baseline kernel version.
Bibo, This issue was addressed in the mainline (ref:http://sources.redhat.com/bugzilla/show_bug.cgi?id=2162)
Maybe I did not clarify that, kfree is called in two places, one is by free_rp_inst() function, the other is by recycle_rp_inst() function, http://sources.redhat.com/bugzilla/show_bug.cgi?id=2162 is the issue about kfree_rp_inst(), but recycle_rp_inst also calls kfree function and recycle_rp_inst is called with kretprobe_lock held.
Created attachment 1156 [details] kretprobe scale spinlock
I refreshed the patch in the attachment, and test it in IA64, test result shows that it improved multi-kprobe performance greatly if there are multi spinlock for each hash listhead. There are mainly two points about this patch: 1) change hash method, in general task_struct point is page size alignment, here changed as hash_ptr((void*)((unsigned long)tsk >> PAGE_SHIFT), bits) 2) add kretprobe instance to empty list if ri->rp==0, and free ri outof spinlock scope. The test case in the attachment is for IA64, it need modification for other architecture.
Created attachment 1157 [details] test case for multi-thread contention kretprobe This patch tests kretprobe performance by getpid() system call, generally it will link tls-libc library where pid is cache in thread info and will generate sys_getpid system call for later times. So I compile getpid.c with --static option, it will be good if there is better method.
(In reply to comment #19) > ====== 2.6.17.3/ppc64/8-way, without Jim's patch ================ > > no probe > Total cpus: loops = 40000000, average = 6202 ns > > kretprobe using stap: > Total cpus: loops = 40000000, average = 43702 ns > > kretprobe using getsid.c: > Total cpus: loops = 40000000, average = 36456 ns > > > ======= 2.6.17.4/ppc64/8-way, with Jim's patch =================== > > kretprobe using stap: > Total cpus: loops = 40000000, average = 26621 ns > > kretprobe using getsid.c: > Total cpus: loops = 40000000, average = 24975 ns This is actually pretty much what I'd expect to see. All the CPUs are hitting the same probepoint repeatedly and piling up on the same per-kretprobe lock. But the performance gain is reasonably good, and even better when stap's involved. The stap-generated handler takes longer than the empty kprobes handler, so we get more benefit from not holding a lock while the handler runs. We should see more benefit with multiple kretprobes -- e.g. probe syscall.*.return { ... } Keep the comments and improvements coming. Thanks.
Jim, would you like to arrange one clean patch for recent kernel version? I will test it on x86 and IA64 platform, basically this method has good performance improvement for kretprobe.
This has been sitting on the shelf, mostly due to my inability to find time to work on it. It appears that we have working versions for x86, ia64, and ppc64. Ports to other platforms should be just as easy. We see some performance improvements, but we need to verify that improvements are as significant as expected for cases like probe kernel.syscall.*.return { ... } Anybody who has time and inclination to pick this up, please feel free.
Jim, I've spoken to Srinivasa and he has agreed to take a look at this bug. Reassigning to him. Ananth
I have applied the patch on latest kernel with some modifications. Iam running tests on it. Testcase attached in comment#25 by bibo is working fine and but Iam having problem in syscall.*.return{....} stapscript. Iam getting soft-lockup messages. Debugging further ..... Thanks Srinivasa DS
Created attachment 1954 [details] Patch on 2.6.23-rc2 Attached patch implements Jim's idea on fine-grain locking. I have done some testing on this patch using "probe syscall.*.return" systemtap scripts(Iam attaching it here). I will ask Jim, Ananth, Prasanna, Anil to review this patch and update the bug.
Created attachment 1955 [details] Testcase to test above patch
Patch has been proposed in lkml and it is part of the upstream kernel. So closing the bug. Thanks Srinivasa DS