Bug 4179 - systemtap module init cleanup
Summary: systemtap module init cleanup
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Frank Ch. Eigler
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-12 08:56 UTC by Vasily Averin
Modified: 2007-03-15 19:13 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
_stp_transport_init() cleanup (515 bytes, patch)
2007-03-12 09:02 UTC, Vasily Averin
Details | Diff
_stp_register_procfs() cleanup (556 bytes, patch)
2007-03-12 09:05 UTC, Vasily Averin
Details | Diff
_stp_init_time() cleanup (164 bytes, patch)
2007-03-12 09:08 UTC, Vasily Averin
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Vasily Averin 2007-03-12 08:56:42 UTC
I've experimented with systemtab from RHEL5rc1 and crashed the node:

[root@dhcp17-85 ~]# crash /usr/lib/debug/lib/modules/2.6.18-8.1.1.el5/vmlinux
vmcore.rhel5-8.1.1
crash> log
Linux version 2.6.18-8.1.1.el5 (brewbuilder@hs20-bc1-7.build.redhat.com) (gcc ve
rsion 4.1.1 20070105 (Red Hat 4.1.1-52)) #1 SMP Mon Feb 26 20:38:02 EST 2007
...
Error creating systemtap /proc entries.
BUG: unable to handle kernel paging request at virtual address d0d8da47
 printing eip:
d0d8da47
*pde = 0637c067
Oops: 0000 [#1]
SMP
last sysfs file: /module/uhci_hcd/sections/.text
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 video sbs i2c
_ec button battery asus_acpi ac lp snd_ens1371 gameport snd_rawmidi snd_ac97_cod
ec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_dev
ice snd_pcm_oss snd_mixer_oss snd_pcm sg snd_timer floppy snd soundcore snd_page
_alloc e1000 i2c_piix4 parport_pc i2c_core parport pcspkr ide_cd cdrom serio_raw
 dm_snapshot dm_zero dm_mirror dm_mod mptspi mptscsih mptbase scsi_transport_spi
 sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU:    0
EIP:    0060:[<d0d8da47>]    Not tainted VLI
EFLAGS: 00010286   (2.6.18-8.1.1.el5 #1)
EIP is at 0xd0d8da47
eax: 00000000   ebx: 00000100   ecx: c0717fd4   edx: c0717000
esi: c0784e00   edi: d0d8da47   ebp: 00000000   esp: c0717fd0
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, ti=c0717000 task=c0660bc0 task.ti=c06d3000)
Stack: c042cc09 c0717fd4 c0717fd4 00000001 c06c9b08 0000000a c04281cf c06d3f94
       c06d3000 00000046 c0612b83 c0406461
Call Trace:
 [<c042cc09>] run_timer_softirq+0xfb/0x151
 [<c04281cf>] __do_softirq+0x5a/0xbb
 [<c0406461>] do_softirq+0x52/0x9d
 [<c04049bf>] apic_timer_interrupt+0x1f/0x24
 [<c0402b98>] default_idle+0x0/0x59
 [<c0402bc9>] default_idle+0x31/0x59
 [<c0402c90>] cpu_idle+0x9f/0xb9
 [<c06d8798>] start_kernel+0x380/0x387
 =======================
Code:  Bad EIP value.
EIP: [<d0d8da47>] 0xd0d8da47 SS:ESP 0068:c0717fd0

crash> bt -at
PID: 0      TASK: c0660bc0  CPU: 0   COMMAND: "swapper"
      START: crash_kexec at c0442bde
  [c0717f30] die at c04054b3
  [c0717f60] do_page_fault at c05fd66f
  [c0717f90] do_page_fault at c05fd285
  [c0717f98] error_code at c0404a71
  [c0717fd0] run_timer_softirq at c042cc09
  [c0717fe8] __do_softirq at c04281cf
  [c0717ffc] do_softirq at c0406461
--- <soft IRQ> ---
      START: do_softirq at c040640f
  [c06d3fa0] apic_timer_interrupt at c04049bf
  [c06d3fa8] default_idle at c0402b98
  [c06d3fcc] default_idle at c0402bc9
  [c06d3fd8] cpu_idle at c0402c90
  [c06d3fe0] start_kernel at c06d8798

PID: 24084  TASK: ce27c550  CPU: 1   COMMAND: "insmod"
      START: crash_nmi_callback at c0418f55
  [ca6dfdec] do_flush_tlb_all at c0415a96
  [ca6dfe00] smp_call_function at c041591f
  [ca6dfe20] do_flush_tlb_all at c0415a96
  [ca6dfe24] do_nmi at c0405815
  [ca6dfe2c] __wake_up at c041d20d
  [ca6dfe48] nmi_stack_correct at c0404b16
  [ca6dfe60] do_flush_tlb_all at c0415a96
  [ca6dfe74] smp_call_function at c041591f
  [ca6dfe80] do_flush_tlb_all at c0415a96
  [ca6dfe9c] do_flush_tlb_all at c0415a96
  [ca6dfea4] on_each_cpu at c0427b46
  [ca6dfeb8] flush_tlb_all at c04159d7
  [ca6dfec0] __remove_vm_area at c046023e
  [ca6dfecc] remove_vm_area at c0460263
  [ca6dfed4] __vunmap at c04602a8
  [ca6dfee4] sys_init_module at c043d133
  [ca6dfeec] do_sync_read at c046a5df
  [ca6dffb8] syscall_call at c0403eff

VvS>
"Error creating systemtap /proc entries" message points to the failure on the
module initialization. Then CPU 1 frees the resources and CPU0 runs the
systemtap timer function that has not been deleted and it lead to the crash.
Comment 1 Vasily Averin 2007-03-12 09:02:47 UTC
Created attachment 1606 [details]
_stp_transport_init() cleanup

patch against systemtap-20070310
adds missed rollback procedures for _stp_transport_init() function
Comment 2 Vasily Averin 2007-03-12 09:05:34 UTC
Created attachment 1607 [details]
_stp_register_procfs() cleanup

patch against systemtap-20070310

fixes rollback procedures in _stp_register_procfs() function
Comment 3 Vasily Averin 2007-03-12 09:08:28 UTC
Created attachment 1608 [details]
_stp_init_time() cleanup

patch against systemtap-20070310

adds missed rollback for _stp_init_time() function
Comment 4 Vasily Averin 2007-03-12 09:24:32 UTC
Mike Mason, Frank Ch. Eigler

I've included you in cc: because of I believe it may be interesting for you:
http://article.gmane.org/gmane.linux.systemtap/5102
Comment 5 Frank Ch. Eigler 2007-03-12 12:07:23 UTC
testing the patch
Comment 6 Frank Ch. Eigler 2007-03-12 16:13:43 UTC
Unfortunately, I'm still seeting timer-related crashes.
This time, RHEL5 x86-64 SMP, paging fault during run_timer_softirq(), as usual.
I'll continue to dig in this direction.
Comment 7 Frank Ch. Eigler 2007-03-12 16:22:43 UTC
One problem in the transport/procfs code is a race condition on the
creation/deletion of the "/proc/systemtap" directory proper, should
multiple probes be starting/stopping at the same time.
Comment 8 Frank Ch. Eigler 2007-03-12 18:29:57 UTC
The patch was patched and committed.
The /proc/systemtap/ race is gone, though with a drawback of its own (potential
/proc name collision if one runs stap -m "net" .... )
Comment 9 Martin Hunt 2007-03-14 18:30:58 UTC
(In reply to comment #8)
> The patch was patched and committed.
> The /proc/systemtap/ race is gone, though with a drawback of its own (potential
> /proc name collision if one runs stap -m "net" .... )
> 

I've checked in a rewrite of the transport where I am using pids instead of
module names, so that collision can't happen.  
Comment 10 Frank Ch. Eigler 2007-03-15 19:13:26 UTC
patches committed