dwz-0.1 - DWARF compression tool

Jakub Jelinek jakub@redhat.com
Wed Apr 18 12:37:00 GMT 2012

On Wed, Apr 18, 2012 at 09:49:11AM +0200, Mike Dupont wrote:
> this is exciting, thanks for sharing.
> I wonder what amount of data is even the same between many libraries,

Of course there is a lot of DWARF duplication in between different
libraries, or binaries, or e.g. Linux kernel modules (which have the
added problem that they have relocations against the sections; we could
apply and remove the relocations against .debug_* sections (and do string
merging of .debug_str at the same time) there as first step, but there would
be still relocations against the module .text/.data etc.).

The problem with that is that we'd need DWARF extensions to do the
duplication elimination in between different libraries/binaries.

I can think of two possible approaches:

1) indicate somehow that .debug_* sections live elsewhere, in a single
   (per package?) *.debug object, where all the .debug_* sections would be
   concatenated together and then just compress the debug info
   in that large object.  The main problem with that is that suddenly
   all places in the debug info that refer to .text/.data (and other
   allocated sections) addresses need to be augmented somehow to say
   which of the possibly many shared libraries or kernel modules or
   binaries they refer to.  That would be too hard.  It could be
   done just by some attribute in each DW_TAG_*_unit saying what that CU
   refers to (if it uses any addresses anywhere), and other .debug_*
   sections that are solely referenced from .debug_info would be fine too.
   But e.g. .debug_aranges would need extensions...

2) or, alternatively, keep most of the debug info in the individual
   objects (shared libraries, binaries, kernel modules) and just for
   what dwz currently moves over into new DW_TAG_partial_unit CUs (assuming
   it doesn't contain any .text/.data references and only refers to
   DIEs inside of them or in other partial units that don't contain
   any .text/.data references) move those partial units to a .debug_info
   section in a separate file (and add some new .debug_* section that
   would hint the debug info consumers how to find the separate file
   (build-id, or filename, or combination of both, whatever).
   If we support just one such separate file, we could just have
   DW_FORM_alt_sec_offset and DW_FORM_ref_alt_addr new forms, which
   would mean this is the corresponding .debug{_line,_loc,_loc}
   section offset, but not inside of this file, but in the secondary
   file.  If we were to support more than one, we'd need to number them
   and add forms that would say start with uleb128 number index of
   the separate file followed by actual offset.  Still, a shorthand
   form for the first one separate file might be handy, assuming that
   is what is done most of the time.
   With many possibly large binaries/libraries together there are major
   concerns about memory consumption though, so I think the tool would
   need to do it in steps - compress each file individually first
   (what the tool does right now) and for eligible partial units append
   them to a common separate file (and keep them in the original file
   too).  When the first pass over all files is done, merge duplicates
   within the common separate file which holds just the partial units.
   Second pass would then take the reduced common separate file and
   the compressed debug info from the first pass, and find duplicate
   partial units, switch references to them in their forms to the
   alt forms and remove the no longer needed partial units.
   Of course the separate common file would not need to contain
   just .debug_info and .debug_abbrev sections, but also some minimal
   .debug_line section (not containing actual line instructions, but
   dir/file tables).

My preference would be 2).  What do you think?


More information about the Binutils mailing list