Bug 14630

Summary: intermittent crash in lookup_bad_addr
Product: systemtap Reporter: Frank Ch. Eigler <fche>
Component: runtimeAssignee: Unassigned <systemtap>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Frank Ch. Eigler 2012-09-26 17:44:23 UTC
On a x86-64 rhel6 vm, the following was observed during a make installcheck against git stap:

stap_fda8d28487545b13195688c63e0e9bc_13886: systemtap: 2.0/0.152, base: ffffffffa0705000, memory: 57data/29text/20ctx/2058net/34alloc kb, probes: 28
stap_ad67448ddd1c1c3d62320c81d6c3737_14107: systemtap: 2.0/0.152, base: ffffffffa0722000, memory: 57data/29text/20ctx/2058net/34alloc kb, probes: 28
stap_92468660093a64382aac155c65fae81c_14360: systemtap: 2.0/0.152, base: ffffffffa03ef000, memory: 78data/59text/20ctx/2058net/178alloc kb, probes: 28
BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1112
in_atomic(): 0, irqs_disabled(): 1, pid: 14360, name: stapio
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last  enabled at (0): [<(null)>] (null)
hardirqs last disabled at (0): [<ffffffff8106cf44>] copy_process+0x604/0x15d0
softirqs last  enabled at (0): [<ffffffff8106cf44>] copy_process+0x604/0x15d0
softirqs last disabled at (0): [<(null)>] (null)
Pid: 14360, comm: stapio Not tainted 2.6.32-279.5.2.el6.x86_64.debug #1
Call Trace:
 [<ffffffff810ab580>] ? print_irqtrace_events+0xd0/0xe0
 [<ffffffff810582b7>] ? __might_sleep+0xf7/0x130
 [<ffffffff810444a4>] ? __do_page_fault+0x114/0x4e0
 [<ffffffff8118522a>] ? mem_cgroup_bad_page_check+0x1a/0x30
 [<ffffffff8113b57f>] ? get_page_from_freelist+0x35f/0x8e0
 [<ffffffff81534c9e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81531dd5>] ? page_fault+0x25/0x30
 [<ffffffffa03f01cd>] ? lookup_bad_addr+0xcd/0xe0 [stap_92468660093a64382aac155c65fae81c_14360]
 [<ffffffffa03f24d0>] ? probe_4103+0x230/0x2b0 [stap_92468660093a64382aac155c65fae81c_14360]
 [<ffffffffa03f2405>] ? probe_4103+0x165/0x2b0 [stap_92468660093a64382aac155c65fae81c_14360]
 [<ffffffffa03fa0d8>] ? enter_real_tracepoint_probe_0+0x148/0x300 [stap_92468660093a64382aac155c65fae81c_14360]
 [<ffffffffa03ef1ee>] ? enter_tracepoint_probe_0+0xe/0x10 [stap_92468660093a64382aac155c65fae81c_14360]
 [<ffffffff8152e3de>] ? thread_return+0x709/0x7db
 [<ffffffff8152fb55>] ? schedule_hrtimeout_range+0x145/0x170
 [<ffffffffa03f0753>] ? _stp_ctl_poll_cmd+0x33/0x60 [stap_92468660093a64382aac155c65fae81c_14360]
 [<ffffffff811ac629>] ? poll_schedule_timeout+0x39/0x60
 [<ffffffff811acf19>] ? do_select+0x5a9/0x6d0
 [<ffffffff811ac970>] ? do_select+0x0/0x6d0
 [<ffffffff811ac720>] ? __pollwait+0x0/0xf0

This was apparently the systemtap.stress/conversions_trace.stp test case.
Comment 1 Frank Ch. Eigler 2012-09-26 17:50:25 UTC
With $SYSTEMTAP_SYNC, another run produced a similar crash:

stap_2e79445e6d8698522a9f86c95ee78935_4327: systemtap: 2.0/0.152, base: ffffffffa042c000, memory: 57data/29text/20ctx/2058net/34alloc kb, probes: 28
stap_67912742e5bc7aca566eb18b6434082a__5618: systemtap: 2.0/0.152, base: ffffffffa0449000, memory: 78data/59text/20ctx/2058net/178alloc kb, probes: 28
BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1112
in_atomic(): 0, irqs_disabled(): 1, pid: 5618, name: stapio
3 locks held by stapio/5618:
 #0:  (&rq->lock){-.-.-.}, at: [<ffffffff8152e355>] thread_return+0x680/0x7db
 #1:  (& global.s_hits_lock){......}, at: [<ffffffffa0449e75>] stp_lock_probe+0x55/0xc0 [stap_67912742e5bc7aca566eb18b6434082a__5618]
 #2:  (&mm->mmap_sem){++++++}, at: [<ffffffff81044489>] __do_page_fault+0xf9/0x4e0
irq event stamp: 102228
hardirqs last  enabled at (102227): [<ffffffff81531230>] _spin_unlock_irqrestore+0x40/0x80
hardirqs last disabled at (102228): [<ffffffff8153152f>] _spin_lock_irq+0x1f/0x80
softirqs last  enabled at (101522): [<ffffffff81077831>] __do_softirq+0x151/0x210
softirqs last disabled at (101261): [<ffffffff8100c30c>] call_softirq+0x1c/0x30
Pid: 5618, comm: stapio Not tainted 2.6.32-279.5.2.el6.x86_64.debug #1
Call Trace:
 [<ffffffff810ab580>] ? print_irqtrace_events+0xd0/0xe0
 [<ffffffff810582b7>] ? __might_sleep+0xf7/0x130
 [<ffffffff810444a4>] ? __do_page_fault+0x114/0x4e0
 [<ffffffff810aec3f>] ? check_irq_usage+0x9f/0xf0
 [<ffffffff810b07aa>] ? __lock_acquire+0xaca/0x1570
 [<ffffffff810396d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffffa044a1cd>] ? lookup_bad_addr+0xcd/0xe0 [stap_67912742e5bc7aca566eb18b6434082a__5618]
 [<ffffffff810387cc>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff81012da9>] ? sched_clock+0x9/0x10
 [<ffffffff8109e885>] ? sched_clock_local+0x25/0x90


where the instruction in question points to the _read_unlock() call.
I wonder if perhaps this is a lockdep problem.
Comment 2 Frank Ch. Eigler 2012-09-28 21:45:54 UTC
commit 8fee9cc