25449 – Factor out compilation units

Bug 25449 - Factor out compilation units

Summary: Factor out compilation units

Status:	NEW

Alias:	None

Product:	dwz
Classification:	Unclassified
Component:	default (show other bugs)
Version:	unspecified

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Nobody

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-01-23 09:23 UTC by Tom de Vries
Modified:	2024-02-03 13:47 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tom de Vries 2020-01-23 09:23:32 UTC

The dwarf standard contains "Appendix E -- DWARF Compression and Duplicate Elimination (informative)", describing a technique on how to generate smaller debug information:
...
to break up the debug information of a compilation into separate normal and partial compilation units, each consisting of one or more sections. By arranging that a sufficiently similar partitioning occurs in other compilations, a suitable system linker can delete redundant groups of sections when combining object files.
...

DWZ implements this scheme, but with the approach (described in pre-link terms in the appendix) applied post-link.

It does this by:
- moving common DIEs into partial units (tagged with DW_TAG_partial_unit),
- generating DW_TAG_imported_unit/DW_AT_import to import the partial units
into the compilation units which originally contained the DIEs
- referencing DIEs in partial units using DW_FORM_ref_addr, when referenced from
the originally containing compilation units or other partial units.

The appendix has a bit though on "Use of DW_TAG_compile_unit versus DW_TAG_partial_unit":
...
A section group compilation unit that uses DW_TAG_compile_unit is like any other compilation unit, in that its contents are evaluated by consumers as though it were an ordinary compilation unit.

An #include directive appearing outside any other declarations is a good candidate to be represented using DW_TAG_compile_unit.

However, an #include appearing inside a C++ namespace declaration or a function, for example, is not a good candidate because the entities included are not necessarily file level entities.

<SNIP>

Consequently a compiler must use DW_TAG_partial_unit (instead of DW_TAG_compile_unit) in a section group whenever the section group contents are not necessarily globally visible.

This directs consumers to ignore that compilation unit when scanning top level declarations and definitions.

The DW_TAG_partial_unit compilation unit will be referenced from elsewhere and the referencing locations give the appropriate context for interpreting the partial compilation unit.
...

So, there also is an option to tag the created units with DW_TAG_compile_unit instead of DW_TAG_partial_unit, which means no requirement to create DW_TAG_imported_unit/DW_AT_import for such units, which means better compression.

The first C++ example in the appendix shows this situation, and states:
...
This example uses DW_TAG_compile_unit for the section group, implying that the contents of the compilation unit are globally visible (in accordance with C++ language rules). DW_TAG_partial_unit is not needed for the same reason.
...

So, as a first step we could do a optimization in DWZ to look at all the items that are selected to be moved into a PU, and decide whether we can transform the PU into a CU and drop the imports.

Comment 1 Tom de Vries 2020-01-23 10:55:55 UTC

(In reply to Tom de Vries from comment #0)
> So, as a first step we could do a optimization in DWZ to look at all the
> items that are selected to be moved into a PU, and decide whether we can
> transform the PU into a CU and drop the imports.

One of the requirements probably has to be that the items are from a single language.

> An #include directive appearing outside any other declarations is a good
> candidate to be represented using DW_TAG_compile_unit.
> 
> However, an #include appearing inside a C++ namespace declaration or a
> function, for example, is not a good candidate because the entities included
> are not necessarily file level entities.

The appendix suggests DIEs in a namespace are not good candidates, but I think what that tries to say is that if we originally we have a DIE in a namespace:
...
DIE2: compilation unit B
  DIE3: namespace bla
    DIE1
...
and do some factoring out like so:
...
DIE0: factored-out unit A
  DIE1
DIE2: compilation unit B
  DIE3: namespace bla
    DIE4: import DIE0
...
the factored-out unit cannot use DW_TAG_compile_unit, because DIE1 is not a globally visible entry.

However, dwz generates this type of partial unit:
...
DIE0: partial unit A
  DIE3: namespace bla
    DIE1
DIE2: compilation unit B
  DIE4: import DIE0
...
which basically works around this problem, and I don't see a reason here why unit A can't be a compilation unit.

Comment 2 Tom de Vries 2020-01-23 11:49:22 UTC

(In reply to Tom de Vries from comment #0)
> So, there also is an option to tag the created units with
> DW_TAG_compile_unit instead of DW_TAG_partial_unit, which means no
> requirement to create DW_TAG_imported_unit/DW_AT_import for such units,
> which means better compression.

The bit of "no requirement to create DW_TAG_imported_unit/DW_AT_import for such units" is not entirely trivial.

In appendix E we find:
...
Use of DW_TAG_imported_unit

A DW_TAG_imported_unit debugging information entry has an DW_AT_import attribute referencing a DW_TAG_compile_unit or DW_TAG_partial_unit debugging information entry.

A DW_TAG_imported_unit debugging information entry refers to a DW_TAG_compile_unit or DW_TAG_partial_unit debugging information entry to specify that the DW_TAG_compile_unit or DW_TAG_partial_unit contents logically appear at the point of the DW_TAG_imported_unit entry.
...

So, it's possible to do an import of a compile unit.

But in the first C++ example in E.1, the import statement for the compilation unit is missing, while in the first Fortran example in E1, the import statement for the partial unit is included.

Furthermore, at 3.1.1 Normal and Partial Compilation Unit Entries, we have:
...
A compilation unit entry owns debugging information entries that represent all or part of the declarations made in the corresponding compilation. In the case of a partial compilation unit, the containing scope of its owned declarations is indicated by imported unit entries in one or more other compilation unit entries that refer to that partial compilation unit.
...

A bit of explanation about when import is used and when not occurs here in E.1 "C example":
...
The C++ example in this Section might appear to be equally valid as a C example. However, it is prudent to include a DW_TAG_imported_unit in the primary unit (see Figure 84) with an DW_AT_import attribute that refers to the proper unit in the section group.

The C rules for consistency of global (file scope) symbols across compilations are less strict than for C++; inclusion of the import unit attribute assures that the declarations of the proper section group are considered before declarations from other compilations.
...

So, the jist of this seems to be:
- factored out partial unit: needs import
- factored out compilation unit:
  - prudent to import from C compilation unit (but we can have f.i. a
    command line option to not do this, and see what breaks)
  - not required from C++ compilation unit

Comment 3 Tom de Vries 2020-01-23 13:07:11 UTC

(In reply to Tom de Vries from comment #2)
> So, the jist of this seems to be:
> - factored out partial unit: needs import
> - factored out compilation unit:
>   - prudent to import from C compilation unit (but we can have f.i. a
>     command line option to not do this, and see what breaks)

Alternatively, we can try to prove we don't need the import.  If the all the elements in the factored out compilation unit are uniquely named in link scope, there's no confusion about which is meant, and we don't need the import.

>   - not required from C++ compilation unit

Conversely, this may cause problems because there may be different DIEs with the same globally unique name which are not structurally equivalent. This happens for instance with member function templates, where a DIE in one CU representing a named struct can have extra members representing the various member function template instantiations in the CU, making the DIE potentially different from other DIEs representing the same named struct in other CUs. [ Which is why we want a --odr-mode=unify option. ]