On rawhide kernel 3.16.0-0.rc0.git5.1.fc21.x86_64: > BUG: sleeping function called from invalid context at /usr/local/share/systemtap/runtime/linux/perf.c:262 > in_atomic(): 1, irqs_disabled(): 0, pid: 20706, name: towers.x > 1 lock held by towers.x/20706: > #0: (&uprobe->register_rwsem){++++++}, at: [<ffffffff811b1a9f>] uprobe_notify_resume+0x39f/0x9a0 > CPU: 0 PID: 20706 Comm: towers.x Tainted: G OE 3.16.0-0.rc0.git5.1.fc21.x86_64 #1 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > 0000000000000000 000000004a950067 ffff880026543dc8 ffffffff817f6326 > 0000000000000000 ffff880026543df0 ffffffff810d0644 0000000000000001 > ffffffffa03f30c0 ffffffffa03f3028 ffff880026543e20 ffffffffa03e2e42 > Call Trace: > [<ffffffff817f6326>] dump_stack+0x4d/0x66 > [<ffffffff810d0644>] __might_sleep+0x184/0x240 > [<ffffffffa03e2e42>] _stp_perf_read.isra.60.part.61+0x32/0x80 [stap_3299c82377bb59db6a0484698cffdac_20704] > [<ffffffffa03eb1f7>] probe_2262+0x27/0x100 [stap_3299c82377bb59db6a0484698cffdac_20704] > [<ffffffffa03e7739>] stapiu_probe_prehandler+0x249/0x450 [stap_3299c82377bb59db6a0484698cffdac_20704] > [<ffffffff811b1ad0>] uprobe_notify_resume+0x3d0/0x9a0 > [<ffffffff81266574>] ? mntput+0x24/0x40 > [<ffffffff810fa87d>] ? trace_hardirqs_on_caller+0x15d/0x200 > [<ffffffff81019d30>] do_notify_resume+0x80/0x90 > [<ffffffff81800050>] paranoid_userspace+0x4b/0x5a There's a might_sleep() in _stp_perf_read. This is right before calling perf_event_read_value(), which calls mutex_lock() with its own might_sleep(). Probe handlers are never sleepable. At a minimum, we always have preempt_disable() as we grab the context structure. Uprobe handlers actually are called in a sleepable context, but we'll have to take those sleepy actions *before* any preempt or irqsave happens, and thus before we even have a context struct.
*** Bug 19955 has been marked as a duplicate of this bug. ***
Created attachment 9237 [details] rough patch Here's a rough patch that tries to fix this problem. The tricky part here is that we now read the perf counters before we've got a context structure. For storage, I'm using a per-cpu variable. With this patch, all the perf tests pass, except for the last subtest in perf.exp. In that subtest, it runs a script twice, the 2nd time swapping the order of declarations of the perf counters. When this is run with the new code, the subtest fails because the values returned are different between the 2 runs. I don't know what is going on there. There are several "FIXME" comments scattered throughout the patch that need reviewing.
(In reply to David Smith from comment #2) > For storage, I'm using a per-cpu variable. If a probe handler is re-entered, they may clobber each other. This won't happen for uprobes, but could for kprobes. (@perf appears to be accepted in a kprobe, but fails at runtime for me in a quick test. If this is not supposed to be supported, it needs to fail earlier.) But doing anything with per-cpu or smp_processor_id() outside of atomic context is risky. If the perf read did sleep, or you were otherwise preempted, you might wake up on another cpu altogether. (as raised in one of your FIXMEs)
Created attachment 9282 [details] better patch Here's the patch I'm testing now. This version puts the perf values on the stack, with an overridable maximum of 16 perf values per script. So far in testing I haven't found any problems.
A slightly modified version of the last patched was checked in as commit 285ca42. No problems were found in testing.