While running sharedbuf.exp test with latest systemtap and 2.6.29.-rc7 kernel crashes at the time of loading module with unknown symbol [root@mjs22lp5 systemtap-latest-5Mar]# make installcheck RUNTESTFLAGS="sharedbuf.exp" if test \! -e /usr/local/bin/stap; then \ echo /usr/local/bin/stap doesn\'t exist, run make install; \ exit -1; \ fi; \ if test ./stap -nt /usr/local/bin/stap; then \ echo "/usr/local/bin/stap is not recent, run make install"; \ exit -1; \ fi; make -C testsuite installcheck RUNTESTFLAGS="sharedbuf.exp" make[1]: Entering directory `/home/mahesh/systemtap-latest-5Mar/testsuite' make check-DEJAGNU RUNTESTFLAGS="sharedbuf.exp --tool_opts \'install \'" make[2]: Entering directory `/home/mahesh/systemtap-latest-5Mar/testsuite' srcdir=`CDPATH="${ZSH_VERSION+.}:" && cd . && pwd`; export srcdir; \ EXPECT=expect; export EXPECT; \ runtest="env SYSTEMTAP_RUNTIME=/usr/local/share/systemtap/runtime SYSTEMTAP_TAPSET=/usr/local/share/systemtap/tapset LD_LIBRARY_PATH=/usr/local/lib/systemtap CRASH_LIBDIR=/usr/local/lib/systemtap PATH=/usr/local/bin:$PATH SYSTEMTAP_PATH=/usr/local/bin SYSTEMTAP_INCLUDES=/usr/local/include ./execrc runtest"; \ if /bin/sh -c "$runtest --version" > /dev/null 2>&1; then \ exit_status=0; l='systemtap'; for tool in $l; do \ if $runtest --tool $tool --tool_opts \'\' --srcdir $srcdir sharedbuf.exp --tool_opts \'install \'; \ then :; else exit_status=1; fi; \ done; \ else echo "WARNING: could not find \`runtest'" 1>&2; :;\ fi; \ exit $exit_status WARNING: Couldn't find the global config file. kernel location: /boot/vmlinux-2.6.29-rc7 kernel version: 2.6.29-rc7 systemtap location: /usr/local/bin/stap systemtap version: version 0.9/0.140 non-git sources Test Run By root on Fri Mar 6 12:12:42 2009 Native configuration is powerpc64-unknown-linux-gnu === systemtap tests === Schedule of variations: unix Running target unix Using /usr/share/dejagnu/baseboards/unix.exp as board description file for target. Using /usr/share/dejagnu/config/unix.exp as generic interface file for target. Using /home/mahesh/systemtap-latest-5Mar/testsuite/config/unix.exp as tool-and-target-specific interface file. Host: Linux mjs22lp5 2.6.29-rc7 #1 SMP Wed Mar 4 15:11:43 IST 2009 ppc64 ppc64 ppc64 GNU/Linux Snapshot: version 0.9/0.140 non-git sources Distro: Red Hat Enterprise Linux Server release 5.3 (Tikanga) Running /home/mahesh/systemtap-latest-5Mar/testsuite/systemtap.printf/sharedbuf.exp ... mjs22lp5 login: stap_b5c218f09f27482b6ff2b2633e6de861_352: no symbol version for stp_print_flush_test1 stap_b5c218f09f27482b6ff2b2633e6de861_352: Unknown symbol stp_print_flush_test1 Unable to handle kernel paging request for data at address 0xd000000003ee7408 Faulting instruction address: 0xc0000000000b1454 cpu 0x1: Vector: 300 (Data Access) at [c000000009e1b990] pc: c0000000000b1454: .load_module+0x1584/0x1708 lr: c0000000000b143c: .load_module+0x156c/0x1708 sp: c000000009e1bc10 msr: 8000000000009032 dar: d000000003ee7408 dsisr: 40000000 current = 0xc0000000424f4c90 paca = 0xc000000000b13680 pid = 3217, comm = staprun enter ? for help [c000000009e1bd90] c0000000000b1654 .SyS_init_module+0x7c/0x220 [c000000009e1be30] c00000000000855c syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 000004000018edd0 SP (fffffbfe560) is in userspace 1:mon> t [c000000009e1bd90] c0000000000b1654 .SyS_init_module+0x7c/0x220 [c000000009e1be30] c00000000000855c syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 000004000018edd0 SP (fffffbfe560) is in userspace 1:mon> e cpu 0x1: Vector: 300 (Data Access) at [c000000009e1b990] pc: c0000000000b1454: .load_module+0x1584/0x1708 lr: c0000000000b143c: .load_module+0x156c/0x1708 sp: c000000009e1bc10 msr: 8000000000009032 dar: d000000003ee7408 dsisr: 40000000 current = 0xc0000000424f4c90 paca = 0xc000000000b13680 pid = 3217, comm = staprun 1:mon> S msr = 8000000000001032 sprg0= 0000000000000000 pvr = 00000000003e0301 sprg1= c000000000b13680 dec = 00000000589dceec sprg2= c000000000b13680 sp = c000000009e1b3b0 sprg3= c000000000b13680 toc = c000000000a7d520 dar = d000000003ee7408 1:mon> r R00 = 0000000000000000 R16 = 0000000000000000 R01 = c000000009e1bc10 R17 = d000000003ee4910 R02 = c000000000a7d520 R18 = 000000000000000c R03 = 0000000000000001 R19 = d000000003e9c5e0 R04 = bbbbbbbbbbbbbbbb R20 = 00000000000000fb R05 = 0000000000000000 R21 = fffffffffffffffe R06 = c00000003b300620 R22 = d000000003e9b879 R07 = c00000003b300658 R23 = c000000000a33938 R08 = 000000000000005a R24 = 000000000000000b R09 = c000000000966d94 R25 = d000000003e70000 R10 = 0000000000000001 R26 = d000000003ee6080 R11 = c00000003b303520 R27 = d000000003e9ba20 R12 = c000000009e1b990 R28 = d000000003ee7210 R13 = c000000000b13680 R29 = d000000003ee7260 R14 = 00000000000000fb R30 = c000000000a017e8 R15 = d000000003e9c620 R31 = fffffffffffffffe pc = c0000000000b1454 .load_module+0x1584/0x1708 lr = c0000000000b143c .load_module+0x156c/0x1708 msr = 8000000000009032 cr = 44000882 ctr = 0000000000000000 xer = 0000000000000001 trap = 300 dar = d000000003ee7408 dsisr = 40000000 1:mon> di c0000000000b1654 c0000000000b1654 3800f000 li r0,-4096 c0000000000b1658 7c7f1b78 mr r31,r3 c0000000000b165c 7fa30040 cmpld cr7,r3,r0 c0000000000b1660 40fd000c ble+ cr7,c0000000000b166c # .SyS_init_module+0x94/0x220 c0000000000b1664 7f83e378 mr r3,r28 c0000000000b1668 4800015c b c0000000000b17c4 # .SyS_init_module+0x1ec/0x220 c0000000000b166c 7f83e378 mr r3,r28 c0000000000b1670 485a6121 bl c000000000657790 # .mutex_unlock+0x0/0x4 c0000000000b1674 60000000 nop c0000000000b1678 eb7e81c0 ld r27,-32320(r30) c0000000000b167c 38800001 li r4,1 c0000000000b1680 7fe5fb78 mr r5,r31 c0000000000b1684 7f63db78 mr r3,r27 c0000000000b1688 4bfe88f1 bl c000000000099f78 # .blocking_notifier_call_chain+0x0/0xc c0000000000b168c 60000000 nop c0000000000b1690 e87f0110 ld r3,272(r31) 1:mon> di c00000000000855c c00000000000855c f86101c8 std r3,456(r1) c000000000008560 782c0464 rldicr r12,r1,0,49 c000000000008564 e9010178 ld r8,376(r1) c000000000008568 710a0002 andi. r10,r8,2 c00000000000856c 41820620 beq c000000000008b8c # unrecov_restore+0x0/0xc c000000000008570 7d4000a6 mfmsr r10 c000000000008574 794a8042 rldicl r10,r10,48,1 c000000000008578 794a8000 rotldi r10,r10,16 c00000000000857c 7d410164 mtmsrd r10,1 c000000000008580 e92c0080 ld r9,128(r12) c000000000008584 3960fdfc li r11,-516 c000000000008588 71203d87 andi. r0,r9,15751 c00000000000858c 408200cc bne c000000000008658 # syscall_exit_work+0x0/0xc0 c000000000008590 7c235840 cmpld r3,r11 c000000000008594 e8a101a0 ld r5,416(r1) c000000000008598 40800070 bge c000000000008608 # syscall_error+0x0/0x10 1:mon> di c0000000000b1454 c0000000000b1454 e87c01f8 ld r3,504(r28) c0000000000b1458 4bffccb9 bl c0000000000ae110 # .percpu_modfree+0x0/0x1ac c0000000000b145c e86100e0 ld r3,224(r1) c0000000000b1460 48068749 bl c000000000119ba8 # .kfree+0x0/0x180 c0000000000b1464 60000000 nop c0000000000b1468 48000008 b c0000000000b1470 # .load_module+0x15a0/0x1708 c0000000000b146c 3be0fff8 li r31,-8 c0000000000b1470 7f23cb78 mr r3,r25 c0000000000b1474 7ffcfb78 mr r28,r31 c0000000000b1478 48056995 bl c000000000107e0c # .vfree+0x0/0x18 c0000000000b147c 60000000 nop c0000000000b1480 4800ccfd bl c0000000000be17c # .stop_machine_destroy+0x0/0x8c c0000000000b1484 60000000 nop c0000000000b1488 480000f4 b c0000000000b157c # .load_module+0x16ac/0x1708 c0000000000b148c e87e83b8 ld r3,-31816(r30) c0000000000b1490 7fa4eb78 mr r4,r29 1:mon> d d000000003ee7408 d000000003ee7408 **************** **************** | | 1:mon>
If some symbol is unresolved in an incoming module, none of its code is actually supposed to be executed. So the crash you see is probably not systemtap-generated code crashing, but the kernel doing it to itself. Perhaps you can reproduce it with some tiny hand-made module with an unresolved extern function reference.
If your CONFIG_MODVERSIONS=y, try it with "# CONFIG_MODVERSIONS is not set" again to see if there will be different result. Otherwise, seems new problem is introduced in new kernel.
Problem is reproducible even after disabling CONFIG_MODVERSIONS. $ cat /lib/modules/2.6.29-rc7-git2/build/.config|grep MODVERSION # CONFIG_MODVERSIONS is not set
This was reproduced on 2.6.29-rc7/i386, but not reproduced on 2.6.29-rc6/i386. So, that seems kernel bug between them. Maybe, need to do git-bisect? BUG: unable to handle kernel paging request at f9ac9218 IP: [<c045b3cb>] load_module+0x14d9/0x164a Oops: 0000 [#1] SMP DEBUG_PAGEALLOC last sysfs file: /sys/module/xt_state/sections/.text Modules linked in: stap_869395e8a8c1abc66e314b36f05020cd_202 sco bridge stp bnep l2cap bluetooth sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 cpufreq_ondemand powernow_k8 dm_mirror dm_region_hash dm_log dm_multipath uinput snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore rtc_cmos pcspkr dcdbas snd_page_alloc k8temp rtc_core tg3 hwmon i2c_nforce2 rtc_lib i2c_core libphy ata_generic pata_acpi sata_nv [last unloaded: stap_987bd8a88d00f68d84ccb5d3c698cdcb_414] Pid: 9875, comm: staprun Not tainted (2.6.29-rc7 #1) OptiPlex 740 EIP: 0060:[<c045b3cb>] EFLAGS: 00210246 CPU: 0 EIP is at load_module+0x14d9/0x164a EAX: 00200286 EBX: f9ac90e0 ECX: 00000003 EDX: 00000001 ESI: f9ac90e0 EDI: fffffffe EBP: f28c7fa0 ESP: f28c7ea4 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process staprun (pid: 9875, ti=f28c7000 task=f3523e40 task.ti=f28c7000) Stack: 0000002a f3523e40 f28c7f14 00200046 00023b59 00000162 f28c7f5c b95bb088 f9ab703c f9ab735c c08689c4 f3523e40 f3524314 f9a93000 f9ab6cf4 f9ab6b59 f2874aa0 f9ac195c 00000029 00000010 00000000 f9ac90e0 00000000 f28c7f1c Call Trace: [<c0452602>] ? mark_lock+0x1e/0x349 [<c06f3dff>] ? __mutex_lock_common+0x2d4/0x329 [<c06f3e82>] ? mutex_lock_interruptible_nested+0x2e/0x35 [<c045b658>] ? sys_init_module+0x41/0x18c [<c040846b>] ? sysenter_do_call+0x12/0x3f Code: 58 ff ff ff 8b 93 d4 00 00 00 89 d8 e8 84 0e fc ff 83 bd 5c ff ff ff 00 74 0b 8b 85 5c ff ff ff e8 ac e4 ff ff 8b b5 58 ff ff ff <8b> 86 38 01 00 00 e8 9b e4 ff ff 8b 85 44 ff ff ff e8 25 5f 04 EIP: [<c045b3cb>] load_module+0x14d9/0x164a SS:ESP 0068:f28c7ea4 ---[ end trace 2a7ec952fee814db ]--- Kernel panic - not syncing: Fatal exception
$ eu-addr2line -e vmlinux 0xc045b3cb /home/mhiramat/ksrc/linux-2.6/kernel/module.c:2298 2292 free_core: 2293 module_free(mod, mod->module_core); 2294 free_percpu: 2295 if (percpu) 2296 percpu_modfree(percpu); 2297 #if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP) 2298 percpu_modfree(mod->refptr); 2299 #endif 2300 free_mod: 2301 kfree(args); it seems mod->refptr is invalid. Perhaps, below commit caused this issue. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=720eba31f47aeade8ec130ca7f4353223c49170f
Interesting... I don't know a whole lot about percpu_mod*, but looking at the use of mod->refptr, its just allocated and freed -- no other operations are performed on it, or so it seems. Its allocated in load_module() and freed either in load_module() itself on error, or in free_module(). However, free_module() is invoked either from delete_module() or if the module init function returns an error (much after load_module() has run). That leads one to wonder if percpu_modalloc() itself returned an invalid pointer? If so, why is it happening only in this case? Or there may be some kind of data corruption in struct module (refptr is the last element of the structure).
Further datapoint -- on a RHEL5.3 system running rc8, insmod just fails to load a module with an unknown symbol -- so, no crash. However, Fedora's module-init-tools is probably more tolerant?
> Further datapoint -- on a RHEL5.3 system running rc8, insmod just fails to load > a module with an unknown symbol -- so, no crash. However, Fedora's > module-init-tools is probably more tolerant? I see no reason to think that this is a userspace issue.
Subject: Re: 2.6.29-rc7 - kernel crash with sharedbuf.exp stap test script > I see no reason to think that this is a userspace issue. Sure, I don't suggest that could be the main issue however. We need someone with good knowledge of the module loading subsystem as well as the per_cpu allocators to take a closer look at the issue on hand.
From my investigation, this problem is just a kernel bug. mod pointer moved on mod->module_core after copying all sections. So, after freeing module_core, you must not access mod->(members), because it has been freed. I'll report and post a fix patch to LKML. Thanks,
The fix hit Linus' tree (commit 6e2b757).