Bug 10651 - very rare BUG_ON kernel/timer.c:619 due to runtime/time.c
Summary: very rare BUG_ON kernel/timer.c:619 due to runtime/time.c
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-17 02:28 UTC by Frank Ch. Eigler
Modified: 2010-11-27 23:54 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Ch. Eigler 2009-09-17 02:28:34 UTC
Something is calling mod_timer with a timer->function==NULL.

It appears as if the _stp_kill_time function is needlessly racy
(amongst the stp_timer_reregister flag, which should probably be
an atomic_t), and the del_timer_sync()'s.  It wouldn't hurt to
plop a synchronize_sched() in there too before the free_percpu
goo.

This was on an i686 rawhide 4cpu VM, 2.6.31-12.fc12.i686 kernel,
according to the systemtap.log, last attempted tes was bz5274.exp,
and here's the BUG message itself:

------------[ cut here ]------------
kernel BUG at kernel/timer.c:619!
invalid opcode: 0000 [#2] SMP 
last sysfs file: /sys/module/virtio_pci/sections/__mcount_loc
Modules linked in: stap_d91e5e64f72dff7c4c94d049e84aaaab_10830 uprobes
netconsole iptable_nat nf_nat nfsd exportfs configfs nfs lockd fscache nfs_acl
auth_rpcgss sunrpc ipv6 dm_multipath uinput ppdev parport_pc parport
virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core floppy virtio_pci [last
unloaded: stap_f28f0139d183e5474e8c3292a39d7c78_2115]

Pid: 10, comm: ksoftirqd/2 Tainted: G      D W  (2.6.31-12.fc12.i686 #1) 
EIP: 0060:[<c04503fc>] EFLAGS: 00010246 CPU: 2
EIP is at __mod_timer+0x35/0x110
EAX: 00000000 EBX: c3909f7c ECX: 00000000 EDX: 00000000
ESI: c3909f40 EDI: 00001ecc EBP: f6cdde8c ESP: f6cdde6c
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process ksoftirqd/2 (pid: 10, ti=f6cdc000 task=f6ce0000 task.ti=f6cdc000)
Stack:
 00000000 00bdafbe c0605a00 42ad3970 42ad3970 00000246 c3909f40 00001ecc
<0> f6cdde9c c0450547 00000000 42ad3970 f6cdded8 f7d8c15b 9e2aa125 11641494
<0> 00000000 c04500d6 33171125 00000000 2e380e90 4ab18a28 33171125 42ad3970
Call Trace:
 [<c0605a00>] ? spin_bug+0x33/0xae
 [<c0450547>] ? mod_timer+0x2c/0x44
 [<f7d8c15b>] ? __stp_time_timer_callback+0xd4/0xed
[stap_d91e5e64f72dff7c4c94d049e84aaaab_10830]
 [<c04500d6>] ? run_timer_softirq+0x133/0x204
 [<c0450121>] ? run_timer_softirq+0x17e/0x204
 [<c04500d6>] ? run_timer_softirq+0x133/0x204
 [<f7d8c087>] ? __stp_time_timer_callback+0x0/0xed
[stap_d91e5e64f72dff7c4c94d049e84aaaab_10830]
 [<c044b4f5>] ? __do_softirq+0xc8/0x192
 [<c044b63e>] ? ksoftirqd+0x0/0xd6
 [<c044b608>] ? do_softirq+0x49/0x7f
 [<c044b696>] ? ksoftirqd+0x58/0xd6
 [<c045cef1>] ? kthread+0x7b/0x80
 [<c045ce76>] ? kthread+0x0/0x80
 [<c040463f>] ? kernel_thread_helper+0x7/0x10
Code: 1f 44 00 00 89 c3 8b 45 08 89 55 e4 88 4d e8 65 8b 15 14 00 00 00 89 55 f0
31 d2 89 45 e0 89 d8 e8 05 ff ff ff 83 7b 0c 00 75 04 <0f> 0b eb fe 8d 55 ec 89
d8 e8 59 ff ff ff 83 3b 00 89 c6 74 10 
EIP: [<c04503fc>] __mod_timer+0x35/0x110 SS:ESP 0068:f6cdde6c
---[ end trace 8eed33dc7b203ed4 ]---
Comment 1 Josh Stone 2009-09-18 20:06:36 UTC
(In reply to comment #0)
> Something is calling mod_timer with a timer->function==NULL.

This seems to indicate either memory corruption, or the memory was freed and
cleared by the next owner.  It seems like particularly poor timing for the
function pointer to have been valid enough to enter the handler but invalid at
the end of the handler.

> It appears as if the _stp_kill_time function is needlessly racy
> (amongst the stp_timer_reregister flag, which should probably be
> an atomic_t), and the del_timer_sync()'s.  It wouldn't hurt to
> plop a synchronize_sched() in there too before the free_percpu
> goo.

Can you get a crash dump?  I'd like to confirm that _stp_kill_time was actually
attempted, possibly by looking at the backtraces on other cpus and checking if
stp_time==NULL.

The promises of del_timer_sync when it returns are that the handler is not
active and the timer is not queued.  I think this actually makes the reregister
flag superfluous.  It should then be perfectly safe to free the memory, unless
the for_each_online_cpu somehow missed one of the timers...
Comment 2 Frank Ch. Eigler 2009-11-05 14:47:45 UTC
> > Something is calling mod_timer with a timer->function==NULL.
> 
> This seems to indicate either memory corruption, or the memory was freed and
> cleared by the next owner.

I wonder if this was another case of bug #10854.
Comment 3 Frank Ch. Eigler 2010-07-08 14:56:49 UTC
Not recently seen.
Comment 4 Frank Ch. Eigler 2010-10-14 20:47:17 UTC
http://sourceware.org/ml/systemtap/2010-q4/msg00034.html
appears to be a new sighting.
Comment 5 Frank Ch. Eigler 2010-11-27 23:54:44 UTC
commit 3fd1c490 switches to atomic_t flags to the timer callback functions
to defeat mod_timer() calls that might overlap a del_timer_sync.
commit f2b610b does likewise for the transport layer timers.