Something is calling mod_timer with a timer->function==NULL. It appears as if the _stp_kill_time function is needlessly racy (amongst the stp_timer_reregister flag, which should probably be an atomic_t), and the del_timer_sync()'s. It wouldn't hurt to plop a synchronize_sched() in there too before the free_percpu goo. This was on an i686 rawhide 4cpu VM, 2.6.31-12.fc12.i686 kernel, according to the systemtap.log, last attempted tes was bz5274.exp, and here's the BUG message itself: ------------[ cut here ]------------ kernel BUG at kernel/timer.c:619! invalid opcode: 0000 [#2] SMP last sysfs file: /sys/module/virtio_pci/sections/__mcount_loc Modules linked in: stap_d91e5e64f72dff7c4c94d049e84aaaab_10830 uprobes netconsole iptable_nat nf_nat nfsd exportfs configfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 dm_multipath uinput ppdev parport_pc parport virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core floppy virtio_pci [last unloaded: stap_f28f0139d183e5474e8c3292a39d7c78_2115] Pid: 10, comm: ksoftirqd/2 Tainted: G D W (2.6.31-12.fc12.i686 #1) EIP: 0060:[<c04503fc>] EFLAGS: 00010246 CPU: 2 EIP is at __mod_timer+0x35/0x110 EAX: 00000000 EBX: c3909f7c ECX: 00000000 EDX: 00000000 ESI: c3909f40 EDI: 00001ecc EBP: f6cdde8c ESP: f6cdde6c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process ksoftirqd/2 (pid: 10, ti=f6cdc000 task=f6ce0000 task.ti=f6cdc000) Stack: 00000000 00bdafbe c0605a00 42ad3970 42ad3970 00000246 c3909f40 00001ecc <0> f6cdde9c c0450547 00000000 42ad3970 f6cdded8 f7d8c15b 9e2aa125 11641494 <0> 00000000 c04500d6 33171125 00000000 2e380e90 4ab18a28 33171125 42ad3970 Call Trace: [<c0605a00>] ? spin_bug+0x33/0xae [<c0450547>] ? mod_timer+0x2c/0x44 [<f7d8c15b>] ? __stp_time_timer_callback+0xd4/0xed [stap_d91e5e64f72dff7c4c94d049e84aaaab_10830] [<c04500d6>] ? run_timer_softirq+0x133/0x204 [<c0450121>] ? run_timer_softirq+0x17e/0x204 [<c04500d6>] ? run_timer_softirq+0x133/0x204 [<f7d8c087>] ? __stp_time_timer_callback+0x0/0xed [stap_d91e5e64f72dff7c4c94d049e84aaaab_10830] [<c044b4f5>] ? __do_softirq+0xc8/0x192 [<c044b63e>] ? ksoftirqd+0x0/0xd6 [<c044b608>] ? do_softirq+0x49/0x7f [<c044b696>] ? ksoftirqd+0x58/0xd6 [<c045cef1>] ? kthread+0x7b/0x80 [<c045ce76>] ? kthread+0x0/0x80 [<c040463f>] ? kernel_thread_helper+0x7/0x10 Code: 1f 44 00 00 89 c3 8b 45 08 89 55 e4 88 4d e8 65 8b 15 14 00 00 00 89 55 f0 31 d2 89 45 e0 89 d8 e8 05 ff ff ff 83 7b 0c 00 75 04 <0f> 0b eb fe 8d 55 ec 89 d8 e8 59 ff ff ff 83 3b 00 89 c6 74 10 EIP: [<c04503fc>] __mod_timer+0x35/0x110 SS:ESP 0068:f6cdde6c ---[ end trace 8eed33dc7b203ed4 ]---
(In reply to comment #0) > Something is calling mod_timer with a timer->function==NULL. This seems to indicate either memory corruption, or the memory was freed and cleared by the next owner. It seems like particularly poor timing for the function pointer to have been valid enough to enter the handler but invalid at the end of the handler. > It appears as if the _stp_kill_time function is needlessly racy > (amongst the stp_timer_reregister flag, which should probably be > an atomic_t), and the del_timer_sync()'s. It wouldn't hurt to > plop a synchronize_sched() in there too before the free_percpu > goo. Can you get a crash dump? I'd like to confirm that _stp_kill_time was actually attempted, possibly by looking at the backtraces on other cpus and checking if stp_time==NULL. The promises of del_timer_sync when it returns are that the handler is not active and the timer is not queued. I think this actually makes the reregister flag superfluous. It should then be perfectly safe to free the memory, unless the for_each_online_cpu somehow missed one of the timers...
> > Something is calling mod_timer with a timer->function==NULL. > > This seems to indicate either memory corruption, or the memory was freed and > cleared by the next owner. I wonder if this was another case of bug #10854.
Not recently seen.
http://sourceware.org/ml/systemtap/2010-q4/msg00034.html appears to be a new sighting.
commit 3fd1c490 switches to atomic_t flags to the timer callback functions to defeat mod_timer() calls that might overlap a del_timer_sync. commit f2b610b does likewise for the transport layer timers.