17055 – _stp_perf_read needs a sleepable context

Bug 17055 - _stp_perf_read needs a sleepable context

Summary: _stp_perf_read needs a sleepable context

Status:	RESOLVED FIXED

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	translator (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	David Smith

URL:
Keywords:

Duplicates (1):	19955 (view as bug list)
Depends on:
Blocks:

Reported:	2014-06-13 21:29 UTC by Josh Stone
Modified:	2016-05-23 15:27 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
rough patch (3.83 KB, patch) 2016-05-05 14:16 UTC, David Smith	Details \| Diff
better patch (4.28 KB, patch) 2016-05-20 20:19 UTC, David Smith	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Josh Stone 2014-06-13 21:29:41 UTC

On rawhide kernel 3.16.0-0.rc0.git5.1.fc21.x86_64:

> BUG: sleeping function called from invalid context at /usr/local/share/systemtap/runtime/linux/perf.c:262
> in_atomic(): 1, irqs_disabled(): 0, pid: 20706, name: towers.x
> 1 lock held by towers.x/20706:
>  #0:  (&uprobe->register_rwsem){++++++}, at: [<ffffffff811b1a9f>] uprobe_notify_resume+0x39f/0x9a0
> CPU: 0 PID: 20706 Comm: towers.x Tainted: G           OE 3.16.0-0.rc0.git5.1.fc21.x86_64 #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>  0000000000000000 000000004a950067 ffff880026543dc8 ffffffff817f6326
>  0000000000000000 ffff880026543df0 ffffffff810d0644 0000000000000001
>  ffffffffa03f30c0 ffffffffa03f3028 ffff880026543e20 ffffffffa03e2e42
> Call Trace:
>  [<ffffffff817f6326>] dump_stack+0x4d/0x66
>  [<ffffffff810d0644>] __might_sleep+0x184/0x240
>  [<ffffffffa03e2e42>] _stp_perf_read.isra.60.part.61+0x32/0x80 [stap_3299c82377bb59db6a0484698cffdac_20704]
>  [<ffffffffa03eb1f7>] probe_2262+0x27/0x100 [stap_3299c82377bb59db6a0484698cffdac_20704]
>  [<ffffffffa03e7739>] stapiu_probe_prehandler+0x249/0x450 [stap_3299c82377bb59db6a0484698cffdac_20704]
>  [<ffffffff811b1ad0>] uprobe_notify_resume+0x3d0/0x9a0
>  [<ffffffff81266574>] ? mntput+0x24/0x40
>  [<ffffffff810fa87d>] ? trace_hardirqs_on_caller+0x15d/0x200
>  [<ffffffff81019d30>] do_notify_resume+0x80/0x90
>  [<ffffffff81800050>] paranoid_userspace+0x4b/0x5a

There's a might_sleep() in _stp_perf_read.  This is right before calling perf_event_read_value(), which calls mutex_lock() with its own might_sleep().

Probe handlers are never sleepable.  At a minimum, we always have preempt_disable() as we grab the context structure.  Uprobe handlers actually are called in a sleepable context, but we'll have to take those sleepy actions *before* any preempt or irqsave happens, and thus before we even have a context struct.

Comment 1 David Smith 2016-05-05 14:05:21 UTC

*** Bug 19955 has been marked as a duplicate of this bug. ***

Comment 2 David Smith 2016-05-05 14:16:42 UTC

Created attachment 9237 [details]
rough patch

Here's a rough patch that tries to fix this problem. The tricky part here is that we now read the perf counters before we've got a context structure. For storage, I'm using a per-cpu variable.

With this patch, all the perf tests pass, except for the last subtest in perf.exp. In that subtest, it runs a script twice, the 2nd time swapping the order of declarations of the perf counters. When this is run with the new code, the subtest fails because the values returned are different between the 2 runs. I don't know what is going on there.

There are several "FIXME" comments scattered throughout the patch that need reviewing.

Comment 3 Josh Stone 2016-05-05 17:20:01 UTC

(In reply to David Smith from comment #2)
> For storage, I'm using a per-cpu variable.

If a probe handler is re-entered, they may clobber each other.  This won't happen for uprobes, but could for kprobes.

(@perf appears to be accepted in a kprobe, but fails at runtime for me in a quick test.  If this is not supposed to be supported, it needs to fail earlier.)

But doing anything with per-cpu or smp_processor_id() outside of atomic context is risky.  If the perf read did sleep, or you were otherwise preempted, you might wake up on another cpu altogether.  (as raised in one of your FIXMEs)

Comment 4 David Smith 2016-05-20 20:19:08 UTC

Created attachment 9282 [details]
better patch

Here's the patch I'm testing now. This version puts the perf values on the stack, with an overridable maximum of 16 perf values per script.

So far in testing I haven't found any problems.

Comment 5 David Smith 2016-05-23 15:27:25 UTC

A slightly modified version of the last patched was checked in as commit 285ca42. No problems were found in testing.