This is the mail archive of the
binutils@sourceware.org
mailing list for the binutils project.
Re: dwz-0.1 - DWARF compression tool
- From: Jakub Jelinek <jakub at redhat dot com>
- To: Mike Dupont <jamesmikedupont at googlemail dot com>
- Cc: gcc at gcc dot gnu dot org, binutils at sourceware dot org, Roland McGrath <mcgrathr at google dot com>, Cary Coutant <ccoutant at google dot com>, Tom Tromey <tromey at redhat dot com>
- Date: Wed, 18 Apr 2012 14:26:45 +0200
- Subject: Re: dwz-0.1 - DWARF compression tool
- References: <20120418073657.GC16117@tyan-ft48-01.lab.bos.redhat.com> <CAF0qKV1TWHaxo9PmPGu-HxHenHw+SZ2Hfrtfg0BoFbg3r7J3ew@mail.gmail.com>
- Reply-to: Jakub Jelinek <jakub at redhat dot com>
On Wed, Apr 18, 2012 at 09:49:11AM +0200, Mike Dupont wrote:
> this is exciting, thanks for sharing.
>
> I wonder what amount of data is even the same between many libraries,
Of course there is a lot of DWARF duplication in between different
libraries, or binaries, or e.g. Linux kernel modules (which have the
added problem that they have relocations against the sections; we could
apply and remove the relocations against .debug_* sections (and do string
merging of .debug_str at the same time) there as first step, but there would
be still relocations against the module .text/.data etc.).
The problem with that is that we'd need DWARF extensions to do the
duplication elimination in between different libraries/binaries.
I can think of two possible approaches:
1) indicate somehow that .debug_* sections live elsewhere, in a single
(per package?) *.debug object, where all the .debug_* sections would be
concatenated together and then just compress the debug info
in that large object. The main problem with that is that suddenly
all places in the debug info that refer to .text/.data (and other
allocated sections) addresses need to be augmented somehow to say
which of the possibly many shared libraries or kernel modules or
binaries they refer to. That would be too hard. It could be
done just by some attribute in each DW_TAG_*_unit saying what that CU
refers to (if it uses any addresses anywhere), and other .debug_*
sections that are solely referenced from .debug_info would be fine too.
But e.g. .debug_aranges would need extensions...
2) or, alternatively, keep most of the debug info in the individual
objects (shared libraries, binaries, kernel modules) and just for
what dwz currently moves over into new DW_TAG_partial_unit CUs (assuming
it doesn't contain any .text/.data references and only refers to
DIEs inside of them or in other partial units that don't contain
any .text/.data references) move those partial units to a .debug_info
section in a separate file (and add some new .debug_* section that
would hint the debug info consumers how to find the separate file
(build-id, or filename, or combination of both, whatever).
If we support just one such separate file, we could just have
DW_FORM_alt_sec_offset and DW_FORM_ref_alt_addr new forms, which
would mean this is the corresponding .debug{_line,_loc,_loc}
section offset, but not inside of this file, but in the secondary
file. If we were to support more than one, we'd need to number them
and add forms that would say start with uleb128 number index of
the separate file followed by actual offset. Still, a shorthand
form for the first one separate file might be handy, assuming that
is what is done most of the time.
With many possibly large binaries/libraries together there are major
concerns about memory consumption though, so I think the tool would
need to do it in steps - compress each file individually first
(what the tool does right now) and for eligible partial units append
them to a common separate file (and keep them in the original file
too). When the first pass over all files is done, merge duplicates
within the common separate file which holds just the partial units.
Second pass would then take the reduced common separate file and
the compressed debug info from the first pass, and find duplicate
partial units, switch references to them in their forms to the
alt forms and remove the no longer needed partial units.
Of course the separate common file would not need to contain
just .debug_info and .debug_abbrev sections, but also some minimal
.debug_line section (not containing actual line instructions, but
dir/file tables).
My preference would be 2). What do you think?
Jakub