I've experimented with systemtab from RHEL5rc1 and crashed the node: [root@dhcp17-85 ~]# crash /usr/lib/debug/lib/modules/2.6.18-8.1.1.el5/vmlinux vmcore.rhel5-8.1.1 crash> log Linux version 2.6.18-8.1.1.el5 (brewbuilder@hs20-bc1-7.build.redhat.com) (gcc ve rsion 4.1.1 20070105 (Red Hat 4.1.1-52)) #1 SMP Mon Feb 26 20:38:02 EST 2007 ... Error creating systemtap /proc entries. BUG: unable to handle kernel paging request at virtual address d0d8da47 printing eip: d0d8da47 *pde = 0637c067 Oops: 0000 [#1] SMP last sysfs file: /module/uhci_hcd/sections/.text Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 video sbs i2c _ec button battery asus_acpi ac lp snd_ens1371 gameport snd_rawmidi snd_ac97_cod ec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_dev ice snd_pcm_oss snd_mixer_oss snd_pcm sg snd_timer floppy snd soundcore snd_page _alloc e1000 i2c_piix4 parport_pc i2c_core parport pcspkr ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 0 EIP: 0060:[<d0d8da47>] Not tainted VLI EFLAGS: 00010286 (2.6.18-8.1.1.el5 #1) EIP is at 0xd0d8da47 eax: 00000000 ebx: 00000100 ecx: c0717fd4 edx: c0717000 esi: c0784e00 edi: d0d8da47 ebp: 00000000 esp: c0717fd0 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, ti=c0717000 task=c0660bc0 task.ti=c06d3000) Stack: c042cc09 c0717fd4 c0717fd4 00000001 c06c9b08 0000000a c04281cf c06d3f94 c06d3000 00000046 c0612b83 c0406461 Call Trace: [<c042cc09>] run_timer_softirq+0xfb/0x151 [<c04281cf>] __do_softirq+0x5a/0xbb [<c0406461>] do_softirq+0x52/0x9d [<c04049bf>] apic_timer_interrupt+0x1f/0x24 [<c0402b98>] default_idle+0x0/0x59 [<c0402bc9>] default_idle+0x31/0x59 [<c0402c90>] cpu_idle+0x9f/0xb9 [<c06d8798>] start_kernel+0x380/0x387 ======================= Code: Bad EIP value. EIP: [<d0d8da47>] 0xd0d8da47 SS:ESP 0068:c0717fd0 crash> bt -at PID: 0 TASK: c0660bc0 CPU: 0 COMMAND: "swapper" START: crash_kexec at c0442bde [c0717f30] die at c04054b3 [c0717f60] do_page_fault at c05fd66f [c0717f90] do_page_fault at c05fd285 [c0717f98] error_code at c0404a71 [c0717fd0] run_timer_softirq at c042cc09 [c0717fe8] __do_softirq at c04281cf [c0717ffc] do_softirq at c0406461 --- <soft IRQ> --- START: do_softirq at c040640f [c06d3fa0] apic_timer_interrupt at c04049bf [c06d3fa8] default_idle at c0402b98 [c06d3fcc] default_idle at c0402bc9 [c06d3fd8] cpu_idle at c0402c90 [c06d3fe0] start_kernel at c06d8798 PID: 24084 TASK: ce27c550 CPU: 1 COMMAND: "insmod" START: crash_nmi_callback at c0418f55 [ca6dfdec] do_flush_tlb_all at c0415a96 [ca6dfe00] smp_call_function at c041591f [ca6dfe20] do_flush_tlb_all at c0415a96 [ca6dfe24] do_nmi at c0405815 [ca6dfe2c] __wake_up at c041d20d [ca6dfe48] nmi_stack_correct at c0404b16 [ca6dfe60] do_flush_tlb_all at c0415a96 [ca6dfe74] smp_call_function at c041591f [ca6dfe80] do_flush_tlb_all at c0415a96 [ca6dfe9c] do_flush_tlb_all at c0415a96 [ca6dfea4] on_each_cpu at c0427b46 [ca6dfeb8] flush_tlb_all at c04159d7 [ca6dfec0] __remove_vm_area at c046023e [ca6dfecc] remove_vm_area at c0460263 [ca6dfed4] __vunmap at c04602a8 [ca6dfee4] sys_init_module at c043d133 [ca6dfeec] do_sync_read at c046a5df [ca6dffb8] syscall_call at c0403eff VvS> "Error creating systemtap /proc entries" message points to the failure on the module initialization. Then CPU 1 frees the resources and CPU0 runs the systemtap timer function that has not been deleted and it lead to the crash.
Created attachment 1606 [details] _stp_transport_init() cleanup patch against systemtap-20070310 adds missed rollback procedures for _stp_transport_init() function
Created attachment 1607 [details] _stp_register_procfs() cleanup patch against systemtap-20070310 fixes rollback procedures in _stp_register_procfs() function
Created attachment 1608 [details] _stp_init_time() cleanup patch against systemtap-20070310 adds missed rollback for _stp_init_time() function
Mike Mason, Frank Ch. Eigler I've included you in cc: because of I believe it may be interesting for you: http://article.gmane.org/gmane.linux.systemtap/5102
testing the patch
Unfortunately, I'm still seeting timer-related crashes. This time, RHEL5 x86-64 SMP, paging fault during run_timer_softirq(), as usual. I'll continue to dig in this direction.
One problem in the transport/procfs code is a race condition on the creation/deletion of the "/proc/systemtap" directory proper, should multiple probes be starting/stopping at the same time.
The patch was patched and committed. The /proc/systemtap/ race is gone, though with a drawback of its own (potential /proc name collision if one runs stap -m "net" .... )
(In reply to comment #8) > The patch was patched and committed. > The /proc/systemtap/ race is gone, though with a drawback of its own (potential > /proc name collision if one runs stap -m "net" .... ) > I've checked in a rewrite of the transport where I am using pids instead of module names, so that collision can't happen.
patches committed