Bug 1234 - kprobes crashes kernel on x86_64 SMP
Summary: kprobes crashes kernel on x86_64 SMP
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: kprobes (show other bugs)
Version: unspecified
: P1 critical
Target Milestone: ---
Assignee: Jim Keniston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-24 03:11 UTC by Martin Hunt
Modified: 2021-06-18 07:10 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Test module and load program (1.39 KB, application/x-gzip)
2005-08-24 03:14 UTC, Martin Hunt
Details
Patch to fix #1234 on x86_64 (297 bytes, patch)
2005-08-27 10:36 UTC, Jim Keniston
Details | Diff
Patch against RHEL4 U2 to fix #1234 on i386 and x86_64 (376 bytes, patch)
2005-09-01 18:16 UTC, Jim Keniston
Details | Diff
spam (4 bytes, text/plain)
2007-05-06 17:28 UTC, SarahGindin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Hunt 2005-08-24 03:11:58 UTC
I have a module that simply puts a kprobe on sys_getuid() and another program
that calls getuid() in a tight loop.  Removing the module while the test program
is running will usually result in an immediate kernel crash on my dual-processor
64-bit Xeon system running RHEL4U2 (2.6.9-15.ELsmp). On FC4-smp kernels I have
tried, removing the module with the kprobe crashes the test program, but not the
kernel.

x86_64 up kernels and i686 kernels (both up and smp) seem OK. However I don't
have a real i686 smp system, only a hyperthreaded cpu.
Comment 1 Martin Hunt 2005-08-24 03:14:14 UTC
Created attachment 612 [details]
Test module and load program

Unpack the tar file. Type "make". as root, run "./please_crash_me"
Comment 2 Frank Ch. Eigler 2005-08-24 16:29:02 UTC
Does the equivalent systemtap script have the same effect?  Systemtap's
lifecycle control logic is intended to prevent concurrent kprobe execution and
removal.

global n
probe kernel.function("sys_getuid") { n++ }
Comment 3 Ananth N Mavinakayanahalli 2005-08-24 17:12:08 UTC
FWIW, I tried the C tarball modules on a 2proc ppc64 lpar without issues.
Comment 4 Martin Hunt 2005-08-25 08:12:22 UTC
(In reply to comment #2)
> Does the equivalent systemtap script have the same effect?  Systemtap's
> lifecycle control logic is intended to prevent concurrent kprobe execution and
> removal.

I verified it also crashes.

I have never understood your argument that you can prevent concurrent kprobe
execution and removal from outside kprobes.  The generated code only prevents
the core logic from being executed. There is still a window of vulnerability
after the kprobe is triggered and before it checks the atomic session state and
determines it should not be executing.  

Comment 5 Anil S Keshavamurthy 2005-08-25 18:27:19 UTC
I ran the test on IA64 where I had RHEL4 U2 2.6.9-16.EL and I did not see any 
problem. The test FINISHED OK.
 
Comment 6 Jim Keniston 2005-08-25 22:40:15 UTC
Crashes very promptly on a dual-CPU AMD-64 (elm3b30) running RHEL4 U2.

Doesn't crash on my Pentium M uniprocessor.
Comment 7 Jim Keniston 2005-08-27 10:36:21 UTC
Created attachment 625 [details]
Patch to fix #1234 on x86_64

Here's a patch that appears to fix the problem on x86_64.  With this
patch, the please_crash_me script runs to completion, producing the
expected output in /var/log/messages:
									       
					     
Aug 27 02:14:57 elm3b30 kernel: kprobe registered
Aug 27 02:14:59 elm3b30 kernel: sys_getuid() called 2207424 times.
Aug 27 02:14:59 elm3b30 kernel: kprobe registered
Aug 27 02:15:01 elm3b30 kernel: sys_getuid() called 2156331 times.
...
Aug 27 02:18:18 elm3b30 kernel: kprobe registered
Aug 27 02:18:20 elm3b30 kernel: sys_getuid() called 2171258 times.
									       
					     
This patch is intended for vanilla v2.6.13-rc5-mm1, but should
apply (perhaps with a bit of an offset) to RHEL4 U2 as well.
									       
					     
When a kprobe gets unregistered between when we hit the probepoint
and when we go looking for the associated kprobe object, we need to
let the CPU continue as if the probepoint hadn't been hit.  We were
trying to do that, but neglecting to set the IP back to the beginning
of the probed instruction.
									       
					     
We haven't been able to reproduce this bug on the other architectures.
Theoretically, though, I think every architecture's version of
kprobe_handler() needs to be fixed in this same way.  Being in the
middle of a 4-day weekend, I don't have time to do those patches and
the associated testing.  Perhaps Ananth, Prasanna, Kevin, and/or Anil
would like to take a crack at it.  If not, I'm in the phone book.
									       
					     
Given very bad luck, I think you could see this bug on any
multiprocessor.  Your chance of hitting it increases with the frequency
of the probe hits.  This bug crashed elm3b90 when it was running RHEL4
U2, but when it ran v2.6.13-rc5-mm1, the bug consistently just caused
an oops or two and killed the itest process.
Comment 8 Jim Keniston 2005-08-27 10:38:41 UTC
Changing resolution to FIXED.
Comment 9 Ananth N Mavinakayanahalli 2005-08-27 14:37:28 UTC
Added Anil, Prasanna and myself to the cc list.

Anil, Prasanna,

Please review Jim's fix and test it on ia64 and ia32 respectively. I'll give it
a spin on ppc64 on Monday.

Thanks,
Ananth
Comment 10 Vara Prasad 2005-08-29 04:49:15 UTC
Adding Hien to the cc list so he can followup with the fix when he is back 
from vacation. 
Comment 11 Anil S Keshavamurthy 2005-08-29 16:54:36 UTC
(In reply to comment #9)
> Added Anil, Prasanna and myself to the cc list.
> Anil, Prasanna,
> Please review Jim's fix and test it on ia64 and ia32 respectively. I'll give 
it
> a spin on ppc64 on Monday.

The above patch looks fine to me and the same is need for ia32 also. IA64 does 
not need the above fix since IP does not gets incremented when break 
instruction is encountered and hence IA64 need not have to correct the IP.

As mentioned earlier, IA64 has no probelems and the test runs fine.

Not sure whether ppc64 needs the above fix?
Comment 12 Ananth N Mavinakayanahalli 2005-08-29 19:20:40 UTC
I just verified, as with IA64, we don't need this on PPC64 either. We don't
advance the instruction pointer to the next instruction on PPC64 too.
Comment 13 Martin Hunt 2005-08-29 21:36:51 UTC
I verified the patch fixes the problem for me.
(RHEL4U2 dual processor x86_64)
Comment 14 Prasanna S Panchamukhi 2005-08-30 09:49:34 UTC
Jim,

I verified the patch on i386 4-way SMP box, it fixes the problem.

Thanks
prasanna
Comment 15 Jim Keniston 2005-09-01 18:16:08 UTC
Created attachment 639 [details]
Patch against RHEL4 U2 to fix #1234 on i386 and x86_64

As noted above, the problem doesn't exist for ia64 or ppc64.  Dave Miller
confirmed that it's not a problem for sparc64.	That leaves i386 and x86_64,
for which I've created and tested patches for RHEL4 U2 (attached) and v2.6.13. 
The v2.6.13 patch has been accepted into the -mm tree.
Comment 16 SarahGindin 2007-05-06 17:28:10 UTC
Created attachment 1760 [details]
spam