I've run stap -ve 'global pids; probe syscall.* {pids[pid()]<<<1}' on 2.6.29-0.31.rc1.git2.fc11.i686 and got below double fault. --- PANIC: double fault, gdt at c34d3000 [255 bytes] double fault, tss at c3619d80 eip = c06e4b9d, esp = f1a48f98 eax = f1a490d0, ebx = f8c5b1f0, ecx = 0000007b, edx = 00000000 esi = c0537f0e, edi = 00000000 --- I could reproduce this fault and it always happened the 2nd time of staprun. So, how to reproduce is; 1. run stap -ve 'global pids; probe syscall.* {pids[pid()]<<<1}' 2. stop it (if you see a privilege error, don't care or remove stap module by staprun -d) 3. run above command again. And "c06e4b8f T do_page_fault". c06e4b8f <do_page_fault>: c06e4b8f: 55 push %ebp c06e4b90: 89 e5 mov %esp,%ebp c06e4b92: 57 push %edi c06e4b93: 89 d7 mov %edx,%edi c06e4b95: 56 push %esi c06e4b96: 53 push %ebx c06e4b97: 81 ec 24 01 00 00 sub $0x124,%esp c06e4b9d: 89 85 d0 fe ff ff mov %eax,-0x130(%ebp)
I disabled unregister_kprobes() and tested, but same double fault happened.
This might be related to synchronize_sched() changes. Here, the syscall2.ko is the pre-compiled script of 'global pids; probe syscall.* {pids[pid()]<<<1}' *without* batch unregister. $ time staprun -d syscall2.ko real 0m0.005s user 0m0.000s sys 0m0.003s This machine has 4 cpu and syscall2.ko has 318 probes. this means when removing syscall2.ko, kprobes may call synchronize_sched() 318 times. Usually, it needs more than 100ms because it doesn't use unregister_kprobes().
Here, I checked unloading time on 2.6.28-mm1 on 2CPU machine. $ time staprun -d syscall2.ko real 0m0.340s user 0m0.000s sys 0m0.014s As you can see, real took over 340ms because synchronize_sched() waits scheduler synchronization.
even if I use a vanilla 2.6.28-rc1 kernel(use same kconfig), double fault occurred... $ time staprun -d syscall2.ko real 0m0.006s user 0m0.000s sys 0m0.003s
My deeper investigation showed that this bug isn't related to synchronize_sched()... I've gotten below result of 'staprun -d' on 2.6.29-rc1 but it caused double fault. real 0m0.629s user 0m0.000s sys 0m0.025s
Hmm, I could narrowed down this bug occurs on the 2.6.29-rc1 kernel with CONFIG_HIGHMEM4G=y, but not occur with CONFIG_HIGHMEM64G=y(and CONFIG_X86_PAE=y, of course). It is needed to check other configurations, e.g. - CONFIG_NOHIGHMEM=y case - before 2.6.28 kernel.
(In reply to comment #6) > It is needed to check other configurations, e.g. > - CONFIG_NOHIGHMEM=y case > - before 2.6.28 kernel. Unfortunately, this happened on 2.6.28 too.
(In reply to comment #7) > (In reply to comment #6) > > It is needed to check other configurations, e.g. > > - CONFIG_NOHIGHMEM=y case This config didn't cause kernel panic. So, that the problem is CONFIG_HIGHMEM4G.
Created attachment 3671 [details] First try log of git-bisect (failed) I tried to narrow it down by git-bisect. Here is the log. Unfortunately, the double fault still happened after reverting 7c88db0cb589df980acfb2f73c3595a0653004ec commit. So I'm trying bisect again...
Created attachment 3672 [details] Second try log of git-bisect (succeeded) OK, I succeeded to narrow down the bug. According to this bisect log, this bug was caused by below commit; http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=db64fe02258f1507e13fe5212a989922323685ce ... that's a big patch.
I've run stap -ve 'probe syscall.* {}' on that kernel, and it also caused kernel crash.
(In reply to comment #11) > I've run stap -ve 'probe syscall.* {}' on that kernel, and it also caused kernel > crash. When I ran a module which just register/unregister 317 kprobes(based on syscall.*), it also crashed kernel. So this bug is kprobes bug or kernel bug, not systemtap.
(In reply to comment #12) > When I ran a module which just register/unregister 317 kprobes(based on > syscall.*), it also crashed kernel. So this bug is kprobes bug or kernel bug, > not systemtap. If I commented out kprobes related functions, it didn't cause kernel crash.
Created attachment 3682 [details] kconfig for reproducing this bug This bug still occur on 2.6.29-rc2 with attached kconfig.
kprobes uses vmap/vunmap only in text_poke(). So this function should trigger crash. When I tested to change vmap/vunmap to vm_map_ram/vm_unmap_ram, the kernel didn't crash.
(In reply to comment #14) > This bug still occur on 2.6.29-rc2 with attached kconfig. > Oops, this was my test failure(forgot to update test program). sorry for confusion, this bug seems not to be reproducable on 2.6.29-rc2.
(In reply to comment #16) > Oops, this was my test failure(forgot to update test program). > sorry for confusion, this bug seems not to be reproducable on 2.6.29-rc2. Hmm, from the result of deeper (bisect) investigation, this issue hasn't fixed yet on 2.6.29-rc2. Sorry about that. I'll post a bugfix patch. however, I still can't find what causes double fault.
Created attachment 3689 [details] [PATCH] text_poke side bugfix This patch fixes this issue. However, this patch might not fix the root cause of this issue ultimately, because I haven't find that the mechanism of this issue yet. This patch just modifies text_poke() to work as in old days.
If this is a problem with vmap and its interaction with text_poke(), I guess other users of text_poke will also run into problems. Maybe its good to bring this issue to Mathieu's notice? Maybe its time to move this issue to lkml.
(In reply to comment #19) > If this is a problem with vmap and its interaction with text_poke(), I guess > other users of text_poke will also run into problems. Maybe its good to bring > this issue to Mathieu's notice? Actually, other users of text_poke() are "alternative" functions which changes spinlock code and so on. Since they have done while booting kernel, kernel might be not affected so much. anyway, I agree that this issue is worth reporting to Mathieu and Nick Piggin. > Maybe its time to move this issue to lkml. Sure. Thanks,
Hmm, I've found one cause of this bug helped by kdump & crash (thanks!) I found that repeating page faults caused by combination of kprobe_fault_handler and pte-fixup which is a special routine for x86 kernel with CONFIG_HIGHMEM4G. Double fault and memory corruptions were collaterally caused by stack overflow. f0e11e80: f0e11e90 trace_hardirqs_off_caller+24 kpfh kp+12480 f0e11e90: f0e11ef4 kpfh kp+12480 do_page_fault f0e11ea0: f0e11ef4 /error_code+119/ {kpfh 0000000e f0e11eb0: [selinux_inode_security] kp+12480 [selinux_inode_security] f0e11ef4 f0e11ec0: kp+12480 0000007b 0000007b 000000d8 f0e11ed0: ffffffff kpfh(%eip) 00000060 00010082} <Nth page fault occurred on fault handler...> f0e11ee0: kprobe_fault_handler+167 0000000e kpfh kpfh f0e11ef0: 00000000 [selinux_inode_security] do_page_fault+123 [selinux_inode_se curity] f0e11f00: f4949434 f00729c0 f4949400 00000000 f0e11f10: 00000000 00000000 00000000 00000000 f0e11f20: 00000000 00000000 00000000 00000000 f0e11f30: 00000000 00000000 00000000 00000000 f0e11f40: 00000000 00000000 00000000 00000000 f0e11f50: 00000000 00000000 00000000 00000000 f0e11f60: 00000000 00000000 00000000 00000000 f0e11f70: 00000000 00000000 00000000 00000000 ... f0e17ce0: trace_hardirqs_off_caller+24 kpfh kp+12480 f0e17d50 f0e17cf0: kpfh kp+12480 do_page_fault f0e17d50 f0e17d00: /error_code+119/ {kpfh 0000000e f0e17e90 f0e17d10: kp+12480 f0e17e90 f0e17d50 kp+12480 f0e17d20: 012d007b 0000007b [skbuff_fclone_cache] ffffffff f0e17d30: kpfh(%eip) 00000060 00010082}(regs) <2nd page fault occurred on fault handler due to pte-fixup!> kprobe_fault_handler+167 f0e17d40: 0000000e kp+12480 kph 00000000 f0e17d50: f0e17e88 do_page_fault+123 f0e17e90 f4949434 f0e17d60: f00729c0 f4949400 ---<below 256bytes are previous data on stack>--- 000005a8 00000000 f0e17d70: 00000001 00000003 ac46cfa5 [TCP] f0e17d80: 00000246 f0e17dc0 00000246 f0e17db0 f0e17d90: 00000246 release_sock+40 f0e17da8 00000002 f0e17da0: 00000000 f4949464 f00729c0 00000002 f0e17db0: 00000000 f4949464 f00729c0 f0e17dc4 f0e17dc0: local_bh_enable_ip+8 f0e17dd0 00000001 00000000 f0e17dd0: rcu_lock_map f00729c0 [skbuff_fclone_cache] [skbuff_fclone_cache] f0e17de0: 00000246 f4949464 f0e17e64 f0e17e10 f0e17df0: 00000246 f4949464 f0e17e70 f0e17e20 f0e17e00: 00000246 00000001 00000002 00000000 f0e17e10: 00000000 might_fault+67 00000246 00000000 f0e17e20: 00000246 0000000e b9df0980 0000000e f0e17e30: b9df0980 f0e17e70 f0e17e4c set_fd_set+41 f0e17e40: 00000004 00000001 0000000e f0e17f84 f0e17e50: core_sys_select+418 b9df0980 b9df0990 f0e17e60 f0e17e60: 00002748 00000000 ---- f0e17e78 trace_hardirqs_off_caller+24 f0e17e70: kp+12480 f0e17f7c f0e17ee0 kp+12480(%ebx) f0e17e80: f0e17f7c(%esi) do_page_fault(%edi) f0e17ee0(%ebp) /error_code+119/ f0e17e90: {kp+12480 kph f0e17f7c f0e17f7c f0e17ea0: cbf6d5e0 f0e17ee0 kp+12480 0000007b f0e17eb0: 0000007b 000000d8 ffffffff(orig_ax) kph(%eip) f0e17ec0: 00000060 00010086(flags)}(regs) <page fault occurred, due to pte fixup!> ---<this is normal kprobe handler call>--- /kprobe_exceptions_notify+467/ f0e17f74 f0e17ed0: do_sync_write+171 kprobe_exceptions_nb ffffffff 00000000 f0e17ee0: f0e17f00 notifier_call_chain+43 f0e17f44 00000002 f0e17ef0: trace_die_notifier die_chain+36 f0e17f44 00000002 f0e17f00: f0e17f2c __atomic_notifier_call_chain+57 ffffffff 00000000 f0e17f10: 00000002 00000001 00000000 __atomic_notifier_call_chain f0e17f20: f0e17f7c 00000000 b7b28ff4 f0e17f3c f0e17f30: atomic_notifier_call_chain+12 ffffffff 00000000 f0e17f58 f0e17f40: notify_die+45 f0e17f7c kallsyms_token_index+5776 00000000 f0e17f50: 00000003 00000005 f0e17f74 do_int3+31 f0e17f60: 00000000 00000003 00000005 00000000 f0e17f70: 00000008 f0e17000 int3+44 00000000 f0e17f80: 0002498e 0114d071 00000008 b7b28ff4 f0e17f90: f0e17000 000000af 0000007b 4000007b f0e17fa0: 000000d8 ffffffff sys_rt_sigprocmask+1 00000060 f0e17fb0: 00000283 sysenter_do_call+18 00000000 bfa29638 f0e17fc0: bfa296b8 00000008 b7b28ff4 bfa29618 f0e17fd0: 000000af 0000007b 0000007b _text f0e17fe0: 000000af b7fb9424 00000073 00000246 f0e17ff0: bfa2957c 0000007b 00000000 00000000
Created attachment 3692 [details] test module for this issue BTW, this plain-c code can cause memory corruption. Even if we disables fault handler(please test it), kernel causes a crash.
Here are my test hw environments (the issue happened on both with uploaded kconfig) - AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ (2core) - 4GB RAM - Fedora 10 - Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (4core) - 2GB RAM - Fedora 10
Masami, I tested your module on a 4 proc x86-32 xeon box and without the text_poke patch, I hit the kprobe fault handler recursive failure issue: fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) With the text_poke patch too, I see a hard hang. Haven't investigated yet what the issue could be. So yes, there is a potential vm issue with HIGHMEM4G and a kprobe fault handler issue on x86.
(In reply to comment #24) > Masami, > I tested your module on a 4 proc x86-32 xeon box and without the text_poke > patch, I hit the kprobe fault handler recursive failure issue: > fault occurred on kprobes at c042a7fa(@c04413cb:14) > fault occurred on kprobes at c042a7fa(@c04413cb:14) > fault occurred on kprobes at c042a7fa(@c04413cb:14) > fault occurred on kprobes at c042a7fa(@c04413cb:14) So did your kernel hang after that? > With the text_poke patch too, I see a hard hang. Haven't investigated yet what > the issue could be. So yes, there is a potential vm issue with HIGHMEM4G and a > kprobe fault handler issue on x86. Hmm, so my patch could not help your case. Thanks!
Subject: Re: syscall.* probe causes kernel panic(double fault) on rawhide kernel/i686 > So did your kernel hang after that? Yes > > With the text_poke patch too, I see a hard hang. Haven't investigated yet what > > the issue could be. So yes, there is a potential vm issue with HIGHMEM4G and a > > kprobe fault handler issue on x86. > > Hmm, so my patch could not help your case. Actually, it runs for some time before hanging. I am not sure what the precise issue is, but the system does go unresponsive. Ananth
(In reply to comment #26) > Actually, it runs for some time before hanging. I am not sure what the > precise issue is, but the system does go unresponsive. Could you attach your kconfig if possible? I'd like to reproduce it with my patch.
Sorry for the false alarm Masami. Looks like the machine on which I tested the patch earlier had some HDD issues that may have played into the failure. I now tested with and without your patch on a 4 processor P3 with 4G RAM with HIGHMEM4G set and the problem doesn't occur with your patch applied. I'll try to test it on another machine tomorrow just to make sure the problem is indeed fixed with your patch. Thanks for your work!
Hmm.. the other box also has some hardware issues. However, in the failure case, I was able to get more info on this box: BUG: unable to handle kernel <1>BUG: Bad page map in process gdm-rh-security pte:d6c162e8 pmd:16c16067 addr:008b2cc6 vm_flags:08000075 anon_vma:(null) mapping:ef4c3b3c index:c vma->vm_ops->fault: filemap_fault+0x0/0x34e vma->vm_file->f_op->mmap: generic_file_mmap+0x0/0x3b Pid: 4742, comm: gdm-rh-security Not tainted 2.6.29-rc3 #1 Call Trace: [<c0461dc8>] print_bad_pte+0x165/0x176 [<c046319c>] handle_mm_fault+0x22b/0x56f [<c063508d>] do_page_fault+0x275/0x542 [<c0634e18>] do_page_fault+0x0/0x542 [<c0633792>] error_code+0x72/0x78 PANIC: double fault, gdt at c961f000 [255 bytes] double fault, tss at c9623180 eip = c063463d, esp = d6800000 eax = d6800028, ebx = d6c0fd68, ecx = d6800028, edx = 0000000e esi = f1fdbece, edi = 00000002 PANIC: double fault, gdt at c9604000 [255 bytes] double fault, tss at c9608180 eip = c0633725, esp = de800000 eax = 000101fa, ebx = 00010082, ecx = c04eefb0, edx = 0001037a esi = c04eefb0, edi = 00000009 paging request<0>PANIC: double fault, gdt at c960d000 [255 bytes] double fault, tss at c9611180 eip = c04eefb0, esp = ec000004 eax = 000101fa, ebx = 00010082, ecx = c04eefb0, edx = 0001037a esi = c04eefb0, edi = 00000000 at 402f542c IP: [<c04f1fd1>] rb_next+0x18/0x34 *pde = 00000000 Oops: 0000 [#1] SMP
Created attachment 3703 [details] Config on the test machine Config from the test system. The m/c did have hardware issues, but this is the config used on the machine.
Here is the test result on the latest kernel(2.6.29-rc3) -UP means smp kernel & maxcpus=1. -test1 means kprobes syscall probe module without fault handler -test2 means kprobes syscall probe module with fault handler Without my patch: UP - test1 passed 128 times load/unload. - test2 failed at 69th SMP - test1 failed at 7th - test2 failed at 4th With my patch: UP - both tests passed 128 times load/unload. SMP - both tests passed 128 times load/unload. So, the bug may hide in the gap between vunmap and vm_unmap_ram.
Hmm, calling vm_unmap_aliases() right after vunmap() in text_poke() could also solve this bug.
Subject: Re: syscall.* probe causes kernel panic(double fault) on rawhide kernel/i686 On Tue, Feb 03, 2009 at 04:28:54PM -0000, mhiramat at redhat dot com wrote: > Hmm, calling vm_unmap_aliases() right after vunmap() in text_poke() could also > solve this bug. One other way of fixing (errr.. working around) the problem is to revert back to our earlier mechanism that did not use text_poke(), atleast until the actual problem gets fixed.
Actually, my suggestion above won't work since kernel text is now write protected :-(
Created attachment 3713 [details] call vm_unmap_aliases() after vunmap() in text_poke() Masami, You were right. Looking at Nick's patch as well as the usage of vm_unmap_aliases(), it does seem as though just a vunmap() will leave stale vmaps around for lazy flushing and that could be a problem. I have tested the patch on a system that used to crash earlier and it now works without problems.
It seems to me that an urgent patch to LKML for 2.6.28 and 2.6.29-rc is in order.
Hi Ananth, (In reply to comment #35) > Created an attachment (id=3713) > call vm_unmap_aliases() after vunmap() in text_poke() This patch is good for me too. Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Thanks!
Maybe I spoke too early... Even though the above patch improves the situation (no double fault due to vmap) and tests run longer, I do see a kprobe fault handler bug presisting: registering...registered unregistering...fault occu kprobes at c042a7fa(@c04413cb:14) <4>fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) fault occurred on kprobes at c042a7fa(@c04413cb:14) ... The machine then goes unresponsive.
On a kernel with Masami's vm_(un)map_ram() and running tesmod in a loop, I hit the following which could be the cause of the fault(?): kernel BUG at mm/vmalloc.c:843! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:05/0000:05:07.0/host0/target0:0:0/0:0:0:0/vendor Modules linked in: genkprobe(+) autofs4 dm_mirror dm_region_hash dm_log dm_multipath dm_mod sbs sbshc battery ac lp joydev sr_mod ide_cd_mod sg floppy parport_pc i2c_piix4 button tg3 cdrom serio_raw parport e1000 ibmasm pcspkr i2c_core libphy e100 mii usb_storage aic7xxx mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: genkprobe] Pid: 10470, comm: insmod Not tainted (2.6.29-rc3 #2) eserver xSeries 235 -[86714AX]- EIP: 0060:[<c0469cd4>] EFLAGS: 00210246 CPU: 1 EIP is at vm_unmap_ram+0x85/0x181 EAX: 00000000 EBX: fe39a000 ECX: 00000000 EDX: 00000001 ESI: 00000000 EDI: 00000001 EBP: 00000001 ESP: f664aef8 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process insmod (pid: 10470, ti=f664a000 task=f38a7200 task.ti=f664a000) Stack: 00000002 f664af30 fe39acba c0632166 f664af2f c04becb9 c10097c0 c10097e0 00000000 f89f3ba0 0069bff4 00000000 c0632919 cc9f0071 c0633a54 00000000 00000099 00000000 0069bff4 f89f0071 f89f00a4 f89f6480 c040112c 00000000 Call Trace: [<c0632166>] text_poke+0xe4/0x113 [<c04becb9>] sys_msgrcv+0x0/0x4d [<c0632919>] arch_arm_kprobe+0x19/0x1b [<c0633a54>] register_kprobe+0x265/0x2ad [<f89f0071>] init_module+0x0/0x5f [genkprobe] [<f89f00a4>] init_module+0x33/0x5f [genkprobe] fault occurred on kprobes at c042a68f(@f89f0010:14)
I think this is not only vm issue, but also fault handling issue.
I'd like to suggest a systemtap-side workaround for this bug. Execute below code before registering kprobes. -- dummy = alloc_vm_area(PAGE_SIZE); free_vm_area(dummy); -- Alloc_vm_area() calls vmalloc_sync_all() internally, which synchronizes all process PGDs to the kernel PGD. Thus, no more on-demand PTE fixup occurs on the kprobe handlers. I've checked this worked fine on the 2.6.29-rc3 kernel.
> dummy = alloc_vm_area(PAGE_SIZE); > free_vm_area(dummy); Workaround tested and committed, 482472c
Created attachment 3721 [details] [PATCH] x86/kprobe : call kprobe_fault_handler after gpd fixup This patch solves this issue as far as I tested. Please test it!
Verified that the patch above does fix the double fault problem without the need for the vm fixes.
The kernel patch was merged to linus tree. 9be260a646bf76fa418ee519afa10196b3164681