On s390x, kernel 2.6.18-128.el5, stap gets the following assertion: stap: offline.c:69: dwfl_offline_section_address: Assertion `mod->e_type == 1' failed. when running the following tests: testsuite/buildok/four.stp testsuite/buildok/nfsd-all-probes.stp buildok/scsi-all-probes.stp buildok/scsi.stp testsuite/buildok/seventeen.stp testsuite/buildok/twentyfive.stp
I fixed the crash in elfutils git, for 0.141 when it comes (soonish). But this hits from a call to dwfl_module_address_section in dump_unwindsyms on the "kernel" (vmlinux) module. IMHO it should not be calling dwfl_module_address_section for any non-ET_REL module (i.e. non-.ko). If that's fixed it won't provoke the bug on the older elfutils libraries.
(In reply to comment #1) > I fixed the crash in elfutils git, for 0.141 when it comes (soonish). > > But this hits from a call to dwfl_module_address_section in dump_unwindsyms on > the "kernel" (vmlinux) module. IMHO it should not be calling > dwfl_module_address_section for any non-ET_REL module (i.e. non-.ko). If that's > fixed it won't provoke the bug on the older elfutils libraries. You are right, that logic was trying too be way too clever, it should just see that dwfl_module_relocations() returned <= 1 relocation sections and then not call it. Fixed in: commit eadbd95761af3c2815e1b36df5a7d18dd28112a4 Author: Mark Wielaard <mjw@redhat.com> Date: Wed Apr 22 12:53:39 2009 +0200 Simplify section size logic. * translate.cxx (dump_unwindsyms): Just check that dwfl_module_relocations() return more than 1 relocation section bases before calling dwfl_module_address_section(). That works fine with 0.140 or before your last elfutils commit. But with your latest fix applied you will hit the new assert in __libdwfl_relocate_value for mod->e_type == ET_REL. The problem is the kernel module which isn't ET_REL. But when dwfl_module_address_section() is called on an ET_REL kernel module it might end up going through resolve_symbol() for that module, which loops through the dwfl->modulelist and ends up trying to call __libdwfl_relocate_value on some kernel module value. The test cases that fail in the above manner, on x86_64, with elfutils tip (19a8e4db) are precisely those of comment #1, on s390, so at least the failures are consistent per platform now.
Unfortunately the workaround still does hit the elfutils (0.137) bug on s390x. But not because we call dwfl_module_relocations() on the kernel module itself. We only call it on an ET_REL kernel module, but as sketched in comment #2 libdwfl still calls __libdwfl_relocate_value() on the kernel module itself when we do that. Will need to think of some other workaround.
buildok/scheduler-all-probes.stp is a different failure: semantic error: no match while resolving probe point kernel.function("__switch_t o") All others occur because they have a probe module.function().
The issue in comment #2 on x86_64 is fixed with the latest elfutils from git, specifically: commit c65558baa0382d59398234c5a05debdc5a98eb1b Author: Roland McGrath <roland@redhat.com> Date: Wed Apr 22 12:29:32 2009 -0700 Fix relocation when symbols are resolved in non-ET_REL modules. Haven't tested on s390x though.
Upgrading to elfutils 0.141 solves this issue also on s390x. A workaround for older elfutils releases would be to rewrite the runtime to collect the kernel module segment sizes at module load time instead of calling dwfl_offline_section_address and storing the section sizes during the translation phase. Some suggestions on how to do this, and why we are currently not doing this, are discussed in the following thread: http://sourceware.org/ml/systemtap/2009-q2/msg00324.html
*** Bug 11753 has been marked as a duplicate of this bug. ***
I've tested this with elfutils-0.152 on 2.6.18-348.el5 s390x, and I see no assertions when running the full buildok.exp testcase.