dwfl_module_addrdie fails for binaries built with clang++

Mark Wielaard mark@klomp.org
Tue May 9 12:40:00 GMT 2017


On Sat, 2017-05-06 at 13:30 +0200, Milian Wolff wrote:
> On Freitag, 5. Mai 2017 15:06:48 CEST Mark Wielaard wrote:
> > On Thu, 2017-05-04 at 18:05 +0200, Milian Wolff wrote:
> > > I noticed that elfutils fails to handle clang binaries when we want to
> > > find a DIE for a certain address. I.e. dwfl_module_addrdie returns
> > > nullptr, and eu- addr2line fails to resolve inlined frames.
> > >
> > > To reproduce this:
> > >[...]
> > >
> > > This also affects us in our perfparser. Not being able to find a cudie
> > > means not finding inlined frames nor file/line mappings, which is quite a
> > > set-back.
> > > 
> > > I have noticed that backward-cpp contains a (partially) work-around for
> > > this:
> > > 
> > > https://github.com/bombela/backward-cpp/blob/master/backward.hpp#L1216
> > 
> > O urgh how utterly broken (not backward-cpp, but the bogus DWARF clang
> > generates). As that comment says:
> > 
> >         // Sadly clang does not generate the section .debug_aranges,
> >         thus
> >         // dwfl_module_addrdie will fail early. Clang doesn't either set
> >         // the lowpc/highpc/range info for every compilation unit.
> >         //
> >         // So in order to save the world:
> >         // for every compilation unit, we will iterate over every single
> >         // DIEs. Normally functions should have a lowpc/highpc/range, which
> >         // we will use to infer the compilation unit.
> > 
> >         // note that this is probably badly inefficient.
> > 
> > And indeed having to scan through every CU to find a matching function
> > DIE is badly inefficient :{
> 
> But this code comment is relatively old. Are we sure it's really still the 
> case?

If you were able to replicate it then yes.

> > > Is this the right approach and also what the non-eu addr2line does? If so,
> > > can that be added upstream too, such that dwfl_module_addrdie can be
> > > relied on?
> > > 
> > > I've seen it on clang 3.6, 4 and 5. Neither passing -g3 nor
> > > -gdwarf-aranges
> > > helps.
> > 
> > Thanks for reporting this. I think this might be the same issue seen
> > here: https://sourceware.org/bugzilla/show_bug.cgi?id=21247
> > ... or at least it seems related. The function/address not found in that
> > case also comes from a CU generated by clang. It does have a lowpc and
> > ranges, but the lowpc looks bogus (zero) and the ranges don't seem to
> > cover the function in question. So it seems even worse than your example
> > where there are no lowpc/ranges. We cannot even trust them if they are
> > there. Sigh.
> 
> So the situation is different from the comment in backward-cpp...

Only in how the lowpc/ranges were broken. The core issue is that we
cannot rely on the lowpc/ranges (and aranges) being correct for a CU. We
assume the DWARF producer doesn't really feed us garbage, but apparently
clang does :{

> > I have to think about how to handle this. We clearly need something that
> > just ignores the lowpc/highpc/ranges on CUs and parses every CU till the
> > function/address DIE is found to know which CU and line_table to use.
> > But that is so inefficient that I don't want to do that by default.
> 
> So, if this is really that bad - what are the binutils doing - does anyone 
> know?

They scan every CU just in case. Which is terrible for performance. Just
compare binutils addr2line vs elfutils eu-addr2line on a large binary.
e.g. on my local machine (best of 3):

$ time eu-addr2line -e /usr/lib64/firefox/libxul.so 0x0157a892
/usr/src/debug/firefox-52.1.0/firefox-52.1.0esr/objdir/dom/bindings/ScrollViewChangeEventBinding.cpp:541

real	0m0.067s
user	0m0.050s
sys	0m0.017s

$ time addr2line -e /usr/lib64/firefox/libxul.so 0x0157a892
/usr/src/debug/firefox-52.1.0/firefox-52.1.0esr/objdir/dom/bindings/ScrollViewChangeEventBinding.cpp:541

real	0m25.984s
user	0m20.847s
sys	0m4.193s

So we definitely don't want to do what binutils does by default.

Note that the worst case is an address that doesn't match against any
function (e.g. what you might get if an unwind goes wrong). Currently
that is the cheapest case (not covered by any CU, so done). But if we
cannot rely on which addresses are covered by which CU then we have to
scan all of them just to make sure there really isn't a subroutine
description in there that does cover the address. I want to prevent us
having to do that "just in case" and only if we (or the user) knows the
DWARF might come from a bad producer. So I am pondering whether we
should add something like -b, --bad, as command line argument for things
like eu-addr2line, eu-stack, to indicate that we need some workarounds
for bad DWARF. Which then would call something like dwarf_force_aranges
() or something which would setup an aranges table created by explicit
scanning of all CUs.

> Also, if it's really against all your expectations, shouldn't we report 
> this upstream at clang and ask for input there? I can't believe they knowingly 
> break their generated code in such a way. Rather, I believe it's either done 
> unknowingly, or there is some alternative way to interpret the data that we 
> are not aware of?

I think they are aware the DWARF they produce is broken. A quick search
finds lots of bug reports about it. The following two specifically seem
relevant for the above case: https://bugs.llvm.org/show_bug.cgi?id=13351
https://bugs.llvm.org/show_bug.cgi?id=30569 

Cheers,

Mark



More information about the Elfutils-devel mailing list