3215 – del_timer_sync causes softlockups

Bug 3215 - del_timer_sync causes softlockups

Summary: del_timer_sync causes softlockups

Status:	RESOLVED FIXED

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	runtime (show other bugs)
Version:	unspecified

Importance:	P1 normal
Target Milestone:	---
Assignee:	Josh Stone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-09-15 16:43 UTC by Martin Hunt
Modified:	2006-09-26 17:25 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Hunt 2006-09-15 16:43:52 UTC

Need to investigate further.

rmmod a systemtap module on an SMP system. There is a good chance this will
cause a crash.

 ep 14 09:52:23 dragon kernel:  <c044a98a> softlockup_tick+0xad/0xc4  <c042d858>
update_process_times+0x39/0x5c
Sep 14 09:52:23 dragon kernel:  <c0418af3> smp_apic_timer_interrupt+0x5a/0x63 
<c040490f> apic_timer_interrupt+0x1f/0x24
Sep 14 09:52:23 dragon kernel:  <c042d550> lock_timer_base+0x27/0x2f  <c042d569>
try_to_del_timer_sync+0x11/0x4a
Sep 14 09:52:23 dragon kernel:  <c042d5ac> del_timer_sync+0xa/0x10  <f8e62ff7>
_stp_kill_time+0x21/0x41 [stap_2872]
Sep 14 09:52:23 dragon kernel:  <f8e63065> _stp_cleanup_and_exit+0x4e/0x62
[stap_2872]  <c04465b4> stop_machine_run+0x2e/0x34
Sep 14 09:52:23 dragon kernel:  <f8e63086> _stp_transport_close+0xd/0x5f
[stap_2872]  <c043eb8b> sys_delete_module+0x192/0x1bb
Sep 14 09:52:23 dragon kernel:  <c045be81> do_munmap+0x196/0x1af  <c0403e3f>
syscall_call+0x7/0xb

Culprit is in runtime/time.c (_stp_kill_time). I've been successfully running
this rewrite of that function, but it is an ugly hack.

	for_each_online_cpu(cpu) {
		stp_time_t *time = &per_cpu(stp_time, cpu);
		int retries = 0;
		while (!del_timer(&time->timer)) {
			retries++;
			if (retries > 1024) {
				printk("Exceeded retry count in _stp_kill_time\n");
				break;
			}
		}
	}

Need to cleanup and understand this better.  See also possibly related bug
http://sources.redhat.com/bugzilla/show_bug.cgi?id=2989

Comment 1 Martin Hunt 2006-09-26 17:25:32 UTC

I've checked in some changes to how the timers are initialized and deleted. I
also changed the percpu allocations to dynamic so we can run multiple systemtap
modules safely.  This seems to have fixed all the timer-related problems I was
seeing, including this one.