Bug 18889 - lost ability to probe kernel module initializers
Summary: lost ability to probe kernel module initializers
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Frank Ch. Eigler
URL:
Keywords:
: 19186 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-08-28 22:24 UTC by Frank Ch. Eigler
Modified: 2015-12-04 20:34 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Ch. Eigler 2015-08-28 22:24:29 UTC
On recent kernels, a probe requested on module .init.* functions no longer receives hits.  This appears to be because the kernel MODULE_STATE_COMING notification is coming too early (before the module_sect_attrs are set), so we cannot relocate symbols against .init.*, and thus cannot place probes.  The bz6503.* test case was too permissive to let us discover this immediately.  The kernel change that broke stap has not yet been identified.
Comment 1 Frank Ch. Eigler 2015-09-02 15:57:53 UTC
commit 0caa975190e4d2b19e7b65a0b23a7700522aa5d7
commit 2278079efc01124dd509241f6c6eadbd6e19cb2a
commit 2e67c14dad1c661d2ce0b0ed218b371c1af218ba
commit 578f5f6f6792fc92d74539a927cd85de5bcbb4dd
Comment 2 David Smith 2015-11-18 19:06:30 UTC
*** Bug 19186 has been marked as a duplicate of this bug. ***
Comment 3 David Smith 2015-11-18 19:08:02 UTC
Unfortunately, the fix for this bug caused bug #19186 (getting a 'BUG: sleeping fujnction called from invalid context' error on rawhide). The kprobes registration function sleeps, and you can't sleep in a tracepoint handler. We'll have to fix this bug a different way.
Comment 4 David Smith 2015-11-18 19:26:52 UTC
Commit fb54a58b93f37aba51d0ecd5b8924e617b229523 reverts original commit 2e67c14dad1c661d2ce0b0ed218b371c1af218ba (the 3rd commit listed in comment #1).

Commit 04ffc2184ffedbb3daa44ffbc0cc2ac956d304be reverts original commit 2278079efc01124dd509241f6c6eadbd6e19cb2a (the 2nd commit listed in comment #1).

The other 2 commits in the original fix are fine, the first adds some debug messages and the last modifies some test cases.

The new fix in commit 24897818cdcf1994a52890ffe756226422df75f3 switches to using symbol_name+offset probing for kernel modules. This appears to be the only way to be able to probe module initializers in recent kernels.

Note that this work is partially based on a patch by James Bottomley from the following link:

<https://sourceware.org/ml/systemtap/2008-q3/msg00182.html>
Comment 5 David Smith 2015-12-01 18:54:35 UTC
Sigh. I've found a complication here. On 3.10.0-229.el7.x86_64 (RHEL 7.1), modules_out_of_tree.exp works correctly. On 3.10.0-327.el7.x86_64 (RHEL 7.2), modules_out_of_tree.exp fails:

====
Running ../../src/testsuite/systemtap.base/modules_out_of_tree.exp ...
PASS: modules_out_of_tree (compilation)
PASS: modules_out_of_tree (listing)
spawn stap -e probe module("/root/rhel7-64/testsuite/modules_out_of_tree/stap_modules_out_of_tree.ko").function("stap_modules_out_of_tree_init") { println("hit") } -c insmod /root/rhel7-64/testsuite/modules_out_of_tree/stap_modules_out_of_tree.ko && rmmod /root/rhel7-64/testsuite/modules_out_of_tree/stap_modules_out_of_tree.ko
FAIL: modules_out_of_tree (probing)
testcase ../../src/testsuite/systemtap.base/modules_out_of_tree.exp completed in 3 seconds
====

Basically the module init function isn't getting a probe hit.

What's odd here is that on this kernel, the new code I added shouldn't be getting used - it only kicks in at kernel version 3.16. I've tried switching to the new code for the failing kernel, but that causes the system to reset.

I need to do some more debugging here to figure out what is going on.
Comment 6 David Smith 2015-12-01 19:09:53 UTC
For reference's sake, here's the console output I get if I switch to the new code on 3.10.0-327.el7.x86_64:

====
[14388.754591] BUG: unable to handle kernel paging request at ffffffffa0493978
[14388.762378] IP: [<ffffffff810ee2f9>] module_kallsyms_lookup_name+0x39/0xc0
[14388.770066] PGD 194d067 PUD 194e063 PMD 4664c0067 PTE 800000085e077161
[14388.777395] Oops: 0003 [#1] SMP 
[14388.781016] Modules linked in: stap_b83341283b00405072ddbc8efa3988d_54172(OE) intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt ipmi_ssif iTCO_vendor_support mei_me pcspkr sb_edac sg lpc_ich mei i2c_i801 edac_core mfd_core shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm ixgbe crct10dif_pclmul ahci mdio crct10dif_common libahci ptp drm crc32c_intel pps_core libata i2c_core dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_1f03fd0c37a653862845287dd5fe236_51524]
[14388.857926] CPU: 29 PID: 54172 Comm: stapio Tainted: G           OE  ------------   3.10.0-327.el7.x86_64 #1
[14388.868900] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[14388.880261] task: ffff880865385080 ti: ffff8808243ec000 task.ti: ffff8808243ec000
[14388.888616] RIP: 0010:[<ffffffff810ee2f9>]  [<ffffffff810ee2f9>] module_kallsyms_lookup_name+0x39/0xc0
[14388.899021] RSP: 0018:ffff8808243efce8  EFLAGS: 00010246
[14388.904951] RAX: ffffffffa0493978 RBX: ffffffffa0493978 RCX: 0000000000000000
[14388.912917] RDX: 000000000000003a RSI: 0000000000000000 RDI: ffffffffa0493960
[14388.920882] RBP: ffff8808243efd08 R08: 0000000000000001 R09: ffffffff8184d42c
[14388.928848] R10: ffffffff8184d42c R11: ffff8808243efb5e R12: ffffffffa0493960
[14388.936814] R13: ffffffffa0493960 R14: 00000000000118dd R15: 0000000000000007
[14388.944780] FS:  00007f566bac2740(0000) GS:ffff88046f5e0000(0000) knlGS:0000000000000000
[14388.953815] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14388.960229] CR2: ffffffffa0493978 CR3: 0000000464aa1000 CR4: 00000000001407e0
[14388.968195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[14388.976160] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[14388.984126] Stack:
[14388.986367]  0000000000000246 00000000c69f8649 00000000000118dd 00000000000d2e14
[14388.994666]  ffff8808243efdc0 ffffffff810eeeb8 696c5f6b72625f5f 615f74670074696d
[14389.002965]  6e660000636f6c6c 7473690000000000 61635f7473696c00 0036000064656c6c
[14389.011265] Call Trace:
[14389.013996]  [<ffffffff810eeeb8>] kallsyms_lookup_name+0x88/0xb0
[14389.020704]  [<ffffffff816416e5>] kprobe_addr+0x35/0x60
[14389.026536]  [<ffffffff8164326e>] register_kprobe+0x1e/0x5d0
[14389.032855]  [<ffffffffa048d5fb>] stapkp_register_probe+0x10b/0x180 [stap_b83341283b00405072ddbc8efa3988d_54172]
[14389.044218]  [<ffffffffa048fc01>] _stp_handle_start.constprop.29+0x281/0x5e0 [stap_b83341283b00405072ddbc8efa3988d_54172]
[14389.056455]  [<ffffffffa0490242>] _stp_ctl_write_cmd+0x2e2/0x570 [stap_b83341283b00405072ddbc8efa3988d_54172]
[14389.067526]  [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0
[14389.073262]  [<ffffffff811df06f>] SyS_write+0x7f/0xe0
[14389.078903]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
====
Comment 7 David Smith 2015-12-02 14:49:22 UTC
It appears that adding a probe with perf on the module initializer works on 3.10.0-327.el7.x86_64:

====
# perf probe --module=./stap_modules_out_of_tree.ko --add=stapmod_init_perf=stap_modules_out_of_tree_init
Added new event:
  probe:stapmod_init_perf (on stap_modules_out_of_tree_init in stap_modules_out_of_tree)

You can now use it in all perf tools, such as:

	perf record -e probe:stapmod_init_perf -aR sleep 1

# perf probe --list
  probe:stapmod_init_perf (on stap_modules_out_of_tree_init in stap_modules_out_
# perf record --event=probe:stapmod_init_perf -a insmod ./stap_modules_out_of_tree.ko ; rmmod stap_modules_out_of_tree
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.231 MB perf.data (1 samples) ]
# perf report --stdio
# To display the perf.data header info, please use --header/--header-only option
#
# Samples: 1  of event 'probe:stapmod_init_perf'
# Event count (approx.): 1
#
# Overhead  Command  Shared Object     Symbol                
# ........  .......  ................  ......................
#
   100.00%  insmod   [sysfillrect]     [k] 0x0000000000005071


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
====

So, in theory anyway it ought to be possible for systemtap to do it.
Comment 8 David Smith 2015-12-02 21:03:42 UTC
OK, here's why I'm getting a crash when switching to the new code (mentioned in comment #6). The version of module_kallsyms_lookup_name() in 3.10.0-327.el7.x86_64 does this:

====
unsigned long module_kallsyms_lookup_name(const char *name)
{
	struct module *mod;
	char *colon;
	unsigned long ret = 0;

	/* Don't lock: we're in enough trouble already. */
	preempt_disable();
	if ((colon = strchr(name, ':')) != NULL) {
		*colon = '\0';
		if ((mod = find_module(name)) != NULL)
			ret = mod_find_symname(mod, colon+1);
		*colon = ':';
	}
	...
====

i.e. it actually modifies the memory at 'name'. In systemtap, that symbol name is in initialized data, so you can't write to that memory. In later versions of the kernel, module_kallsyms_lookup_name() is changed to no longer change the input string.

perf uses kstrdup() to make a copy of the symbol name, which has the side-effect that you can write to that memory.
Comment 9 David Smith 2015-12-02 21:16:32 UTC
It looks like the following kernel commit fixed module_kallsyms_lookup_name():

====
commit 4f6de4d51f4a3ab06a85e91e708cc89a513ef30c
Author: Mathias Krause <minipli@googlemail.com>
Date:   Tue Jul 2 15:35:11 2013 +0930

    module: don't modify argument of module_kallsyms_lookup_name()
    
    If we pass a pointer to a const string in the form "module:symbol"
    module_kallsyms_lookup_name() will try to split the string at the colon,
    i.e., will try to modify r/o data. That will, in fact, fail on a kernel
    with enabled CONFIG_DEBUG_RODATA.
    
    Avoid modifying the passed string in module_kallsyms_lookup_name(),
    modify find_module_all() instead to pass it the module name length.
====

This commit first appeared in kernel version 3.11.
Comment 10 David Smith 2015-12-04 20:34:02 UTC
Re-re-fixed in commit c788eb1. This commit switches kernels >= 3.10 (all RHEL7 kernels) to the new module refresh code and adds a workaround for the kernel bug mentioned in comment #6 and comment #8. We work around the bug by 2 methods:

1) For kernels that support kallsyms_on_each_symbol(), we use that to convert symbol_name+offset probes into address-based probes when the module is loaded. This should be faster as well (see bug #11514 for details).

2) For kernels that don't support kallsyms_on_each_symbol(), which are ppc64 kernels, we workaround the problem by making a copy of the symbol_name in read-write memory before calling register_kprobe().