#! stap global called probe kernel.function("activate_task") { called++; } probe timer.jiffies(100) { exit(); } Linux localhost.localdomain 2.6.12-1.1447_FC4 #1 Fri Aug 26 20:29:51 EDT 2005 i686 athlon i386 GNU/Linux cvs up executed on Oct 28, running the latest elfutils components [root@localhost ~]# stap -V SystemTap translator/driver (version 0.4.2 built 2005-10-28) Copyright (C) 2005 Red Hat, Inc. and others This is free software; see the source for copying conditions. [root@localhost ~]# oops screen shot availible, but only shows a small part of the oops this is not a duplicate of bug 1594 because 1594 requires all 3 probes to produce the bug, this script does it with only two probes and uses different probes. email if i can provide you with any more details
Prasanna, can you take a look at this bug.
I see a hang/reboot just after inserting probes on do_fork on 2.6.13 kernel. Could you please check if your problem exists on 2.6.13 FC kernel. My test m/c hangs/reboots after inserting the probes. This happens consistently on my i386 uniprocessor test m/c . Details are as below [root@wks126728wss ~]# uname -a Linux wks126728wss.in.ibm.com 2.6.13-1.1532_FC4 #1 Thu Oct 20 01:30:08 EDT 2005 i686 i686 i386 GNU/Linux [root@wks126728wss ~]# rpm -qa |grep kernel kernel-devel-2.6.13-1.1532_FC4 kernel-2.6.11-1.1369_FC4 kernel-devel-2.6.11-1.1369_FC4 kernel-debuginfo-2.6.13-1.1532_FC4 kernel-2.6.13-1.1532_FC4 [root@wks126728wss ~]# rpm -qa |grep elfutils elfutils-libelf-devel-0.116-0.1 elfutils-0.116-0.1 elfutils-devel-0.116-0.1 elfutils-libelf-0.116-0.1 elfutils-libs-0.116-0.1 elfutils-debuginfo-0.116-0.1 [root@wks126728wss ~]# stap -V SystemTap translator/driver (version 0.4.2 built 2005-11-09) Copyright (C) 2005 Red Hat, Inc. and others This is free software; see the source for copying conditions. [root@wks126728wss ~]# cat /home/prasanna/examp/test3.stp probe kernel.function("do_fork") { log("e") } [root@wks126728wss examp]# stap -v -g test3.stp Created temporary directory "/tmp/stapUAG0nh" Searched '/usr/local/share/systemtap/tapset/2.6.13-1.1532_FC4/*.stp', match coun t 0 Searched '/usr/local/share/systemtap/tapset/2.6.13/*.stp', match count 0 Searched '/usr/local/share/systemtap/tapset/2.6/*.stp', match count 0 Searched '/usr/local/share/systemtap/tapset/*.stp', match count 9 Pass 1: parsed user script and 9 library script(s). parsed 'do_fork' -> func 'do_fork' pattern 'kernel' matches module 'kernel' focused on module 'kernel' = [c0100000-c05358d8, bias 0] pattern 'do_fork' matches function 'do_fork' selected function do_fork querying prologue-end of function 'do_fork' pattern 'kernel' matches module 'kernel' Pass 2: analyzed user script. 1 probe(s), 6 function(s), 0 global(s). Running grep " [tT] " /proc/kallsyms | sort -k 1,8 -s -o /tmp/stapUAG0nh/symbols .sorted Pass 3: translated to C into "/tmp/stapUAG0nh/stap_2286.c" Running make -C "/lib/modules/2.6.13-1.1532_FC4/build" M="/tmp/stapUAG0nh" modul es make: Entering directory `/usr/src/kernels/2.6.13-1.1532_FC4-i686' CC [M] /tmp/stapUAG0nh/stap_2286.o Building modules, stage 2. MODPOST CC /tmp/stapUAG0nh/stap_2286.mod.o LD [M] /tmp/stapUAG0nh/stap_2286.ko make: Leaving directory `/usr/src/kernels/2.6.13-1.1532_FC4-i686' Pass 4: compiled into "stap_2286.ko" Running sudo /usr/local/libexec/systemtap/stpd -r /tmp/stapUAG0nh/stap_2286.ko e e the system hangs/reboots just after couple of messages. I see a double fault message sometimes. I will checkout on some other test machines.
This may be a candidate for the translator blacklist. Please advise if so, and nominate any additional related functions with the same problem.
Seems that calling prink's from within the activate_task() probe handler causes the hang on i386 smp box. I wrote kernel module that inserts probes on activate_task() without calling any prink's from the handler, it seems to work fine. Below is the example module that works fine. One solution would be to prevent calling printks from probe handler on activate_task() routine. Is is possible for the transilator to avoid calling printks from activate_task() probe handlers? #include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h> #include <linux/sched.h> struct kprobe kp; unsigned long count; int handler_pre(struct kprobe *p, struct pt_regs *regs) { count++; return 0; } void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags){ } int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr) { printk("fault_handler:p->addr=0x%p, eflags=0x%lx\n", p->addr, regs->eflags); return 0; } int init_module(void) { kp.pre_handler=handler_pre; kp.post_handler=handler_post; kp.fault_handler=handler_fault; kp.addr = (kprobe_opcode_t *)0xc0113e20; register_kprobe(&kp); printk("kprobe registered\n"); return 0; } void cleanup_module(void) { unregister_kprobe(&kp); printk("kprobe unregistered count = %d\n", count); } MODULE_LICENSE("GPL");
Subject: Re: probe on activate_task causes oops On 11 Nov 2005 13:04:35 -0000, prasanna at in dot ibm dot com <sourceware-bugzilla@sourceware.org> wrote: > > ------- Additional Comments From prasanna at in dot ibm dot com 2005-11-11 13:04 ------- > Seems that calling prink's from within the activate_task() probe handler causes > the hang on i386 smp box. I wrote kernel module that inserts probes on > activate_task() without calling any prink's from the handler, it seems to work > fine. Below is the example module that works fine. One solution would be to > prevent calling printks from probe handler on activate_task() routine. > Is is possible for the transilator to avoid calling printks from activate_task() > probe handlers? > how about solve the problem rather than write a quick hack that will come back to bite you hundreds, if not thousands of times, in the future? the problem is printk or the way it does it job. It needs to be written so that it does it job correctly no matter the state the system is in. One idea would be to have it send its payload back to userland for later printing you would be closer to a real solution. Every function/task in systemtap that you call needs to be safe if you ever wish to have a stable system while using systemtap. James Dickens uadmin.blogspot.com [snipped] > http://sourceware.org/bugzilla/show_bug.cgi?id=1776 > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. >
(In reply to comment #4) > Seems that calling prink's from within the activate_task() probe handler causes > the hang on i386 smp box. I wrote kernel module that inserts probes on > activate_task() without calling any prink's from the handler, it seems to work > fine. Below is the example module that works fine. One solution would be to > prevent calling printks from probe handler on activate_task() routine. > Is is possible for the transilator to avoid calling printks from activate_task() > probe handlers? Prasanna, I think the solution you are proposing is really not solving the problem. You are only attacking the symptom rather than solving the root cause of the problem. Another that is missing is, we dont really know what is that prink doing that is causing the problem, i think we should get to this before we do this kind of work around.
Subject: Re: probe on activate_task causes oops [snipped] > > fine. Below is the example module that works fine. One solution would be to > > prevent calling printks from probe handler on activate_task() routine. > > Is is possible for the transilator to avoid calling printks from > activate_task() > > probe handlers? > > Prasanna, > I think the solution you are proposing is really not solving the problem. You > are only attacking the symptom rather than solving the root cause of the > problem. Another that is missing is, we dont really know what is that prink > doing that is causing the problem, i think we should get to this before we do > this kind of work around. > > -- > see http://sourceware.org/bugzilla/show_bug.cgi?id=1837 for possible problems in printk. [snipped]
Prasanna: we already know about printk problems (bug #1564). The question is why the first piece of submitted bug report code, which does not call printk, has a problem. It is possible that the translator-emitted timer.jiffies management code is suffering problems analogous to the printk one. Someone who can reproduce this crash should try it without the timer.jiffies probe, or one with a much longer interval. If the crash occurs at timer probe triggering, then please reassign this back to me.
Frank, I did some more investigation and tried several options. Below is the trace I get when I run the given script with and without the timer.jiffies(). Both of them cause the double faults and a stack trace which show that kernel panice due to acquiring already acquired runqueue lock. Not sure what might be the reason for the double fault. ibm-prt4aouz6u1.in.ibm.com login: double fault, gdt at c04ea000 [255 bytes] Kernel panic - not syncing: kernel/sched.c:357: spin_lock(kernel/sched.c:c04eac40) already locked by kernel/sched.c/357. (Not tainted) [<c0127ba8>] panic+0x45/0x1b4 [<c01207fa>] wake_up_process+0x0/0x10 [<c01519d7>] autoremove_wake_function+0x15/0x37 [<c012186b>] __wake_up_common+0x39/0x59 [<c012191d>] __wake_up+0x92/0x252 [<c0128c93>] call_console_drivers+0x7e/0x149 [<c01298c3>] release_console_sem+0x283/0x41c [<c01292b0>] vprintk+0x401/0x70e [<c0128eab>] printk+0x1b/0x1f [<c010dc16>] doublefault_fn+0x36/0xf0 <0>Kernel panic - not syncing: kernel/sched.c:3063: spin_lock(kernel/printk.c:c0458d00) already locked by kernel/sched.c/3063. (Not tain)[<c0127ba8>] panic+0x45/0x1b4 [<c0271f16>] vgacon_dummy+0x0/0xa [<c0121add>] __wake_up_locked+0x0/0x21 [<c01298c3>] release_console_sem+0x283/0x41c [<c01292b0>] vprintk+0x401/0x70e Also I wrote a simple kernel module to insert probes only on activate_task() with the handlers printk ing some registers without timer.jiffies, Here also it panics while single stepping, but does not double fault. wks126320wss.in.ibm.com login: kprobe registered pre_handler: p->addr=0xc0120467, eflags=0x97 Kernel panic - not syncing: kernel/sched.c:3063: spin_lock(kernel/printk.c:c0458d00) already locked by kernel/sched.c/3063. (Not tainted) [<c0127ba8>] panic+0x45/0x1b4 [<c0121add>] __wake_up_locked+0x0/0x21 [<c01298c3>] release_console_sem+0x283/0x41c [<c01292b0>] vprintk+0x401/0x70e Not sure what code added by transilator is causing the double fault even for this simple script even without having any printks. probe kernel.function("activate_task") { } -Prasanna
Created attachment 1427 [details] Test script using stap printf for logging Attached is a slightly modified version of the original script; this version uses stap's printf function to log the number of calls. Both this version and the original version seem to run fine using stap v0.5.10 on an SMP powerpc system. (Also a version that invokes log("a") from the activate_task handler.) Consensus seems to be that: - There's nothing inherently unsafe about probing activate_task() with kprobes, so it's not a kprobes bug. - Calling printk in certain contexts is not as safe as it should be. (activate_task appears to be one such context. I'm opening a separate bug about that; that's not the subject of this bug.) stap no longer emits printk calls, so it's not a stap issue. Kprobes users, as always, are advised to consider possible deadlock situations when coding handlers; so printk's apparent non-safety in said contexts is a risk assumed by bare-knuckles kprobes users. - stap has evolved so much since we last looked at this bug, there's a good chance this bug got fixed somewhere along the way. Prasanna, you were able to reproduce this on i386 way back when. Could you try it again (the stap script -- no printks) and close this bug if you don't see the problem?
Jim, RHEL5-beta2 does not have exported symbol activate_task. I executed the modified script (stap script without printk) mentioned in the message #10 to probe on __activate_task() and it seems to work fine on i386 uni as well as SMP box. Thanks Prasanna