Bug 10088

Summary: on s390x, the translator gets an assertion
Product: systemtap Reporter: David Smith <dsmith>
Component: translatorAssignee: Unassigned <systemtap>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description David Smith 2009-04-21 18:07:45 UTC
On s390x, kernel 2.6.18-128.el5, stap gets the following assertion:

  stap: offline.c:69: dwfl_offline_section_address: Assertion `mod->e_type == 1'
failed.

when running the following tests:

testsuite/buildok/four.stp
testsuite/buildok/nfsd-all-probes.stp
buildok/scsi-all-probes.stp
buildok/scsi.stp
testsuite/buildok/seventeen.stp
testsuite/buildok/twentyfive.stp
Comment 1 Roland McGrath 2009-04-21 22:53:04 UTC
I fixed the crash in elfutils git, for 0.141 when it comes (soonish).

But this hits from a call to dwfl_module_address_section in dump_unwindsyms on
the "kernel" (vmlinux) module.  IMHO it should not be calling
dwfl_module_address_section for any non-ET_REL module (i.e. non-.ko).  If that's
fixed it won't provoke the bug on the older elfutils libraries.
Comment 2 Mark Wielaard 2009-04-22 10:58:56 UTC
(In reply to comment #1)
> I fixed the crash in elfutils git, for 0.141 when it comes (soonish).
> 
> But this hits from a call to dwfl_module_address_section in dump_unwindsyms on
> the "kernel" (vmlinux) module.  IMHO it should not be calling
> dwfl_module_address_section for any non-ET_REL module (i.e. non-.ko).  If that's
> fixed it won't provoke the bug on the older elfutils libraries.

You are right, that logic was trying too be way too clever, it should just see
that dwfl_module_relocations() returned <= 1 relocation sections and then not
call it. Fixed in:

commit eadbd95761af3c2815e1b36df5a7d18dd28112a4
Author: Mark Wielaard <mjw@redhat.com>
Date:   Wed Apr 22 12:53:39 2009 +0200

    Simplify section size logic.
    
    * translate.cxx (dump_unwindsyms): Just check that dwfl_module_relocations()
      return more than 1 relocation section bases before calling
      dwfl_module_address_section().

That works fine with 0.140 or before your last elfutils commit. But with your
latest fix applied you will hit the new assert in __libdwfl_relocate_value for
mod->e_type == ET_REL. The problem is the kernel module which isn't ET_REL. But
when dwfl_module_address_section() is called on an ET_REL kernel module it might
end up going through resolve_symbol() for that module, which loops through the
dwfl->modulelist and ends up trying to call __libdwfl_relocate_value on some
kernel module value.

The test cases that fail in the above manner, on x86_64, with elfutils tip
(19a8e4db) are precisely those of comment #1, on s390, so at least the failures
are consistent per platform now.
Comment 3 Mark Wielaard 2009-04-22 14:38:35 UTC
Unfortunately the workaround still does hit the elfutils (0.137) bug on s390x.
But not because we call dwfl_module_relocations() on the kernel module itself.
We only call it on an ET_REL kernel module, but as sketched in comment #2
libdwfl still calls __libdwfl_relocate_value() on the kernel module itself when
we do that. Will need to think of some other workaround.
Comment 4 Mark Wielaard 2009-04-22 15:04:10 UTC
buildok/scheduler-all-probes.stp is a different failure:
semantic error: no match while resolving probe point kernel.function("__switch_t
o")

All others occur because they have a probe module.function().
Comment 5 Mark Wielaard 2009-04-23 09:37:55 UTC
The issue in comment #2 on x86_64 is fixed with the latest elfutils from git,
specifically:

commit c65558baa0382d59398234c5a05debdc5a98eb1b
Author: Roland McGrath <roland@redhat.com>
Date:   Wed Apr 22 12:29:32 2009 -0700

    Fix relocation when symbols are resolved in non-ET_REL modules.

Haven't tested on s390x though.
Comment 6 Mark Wielaard 2009-04-30 20:40:06 UTC
Upgrading to elfutils 0.141 solves this issue also on s390x.

A workaround for older elfutils releases would be to rewrite the runtime to
collect the kernel module segment sizes at module load time instead of calling
dwfl_offline_section_address and storing the section sizes during the
translation phase. Some suggestions on how to do this, and why we are currently
not doing this, are discussed in the following thread:
http://sourceware.org/ml/systemtap/2009-q2/msg00324.html
Comment 7 Mark Wielaard 2010-06-24 22:02:51 UTC
*** Bug 11753 has been marked as a duplicate of this bug. ***
Comment 8 David Smith 2013-09-10 16:27:34 UTC
I've tested this with elfutils-0.152 on 2.6.18-348.el5 s390x, and I see no assertions when running the full buildok.exp testcase.