Bug 33661 - "pretend language" feature is obsolete
Summary: "pretend language" feature is obsolete
Status: RESOLVED FIXED
Alias: None
Product: gdb
Classification: Unclassified
Component: symtab (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: 18.1
Assignee: Tom Tromey
URL:
Keywords:
Depends on:
Blocks: 30728
  Show dependency treegraph
 
Reported: 2025-11-24 07:47 UTC by Tom Tromey
Modified: 2026-01-23 16:14 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Tromey 2025-11-24 07:47:57 UTC
The DWARF reader has some "pretend language" code that tries
to find the language for a partial unit.  The idea is that the
partial unit won't have a language setting of its own,
so the importing CU's language should be used.

This approach is wrong now on two accounts.

First, a PU's language cannot be set as dwarf2_per_cu::set_lang
explicitly disallows this.

Second, gdb.fortran/mixed-lang-stack.exp, when run
with dwz (or maybe dwz-5, I don't recall) will create
a file where C and C++ units import the same PU.
So, a consistent language cannot be assigned.

Now, the specific case of a C/C++ clash could be
resolved by assuming "C".

Another approach is to re-read all the units for each
language, though this could hugely impact the scanning time.
Comment 1 Simon Marchi 2025-11-25 03:37:29 UTC
(In reply to Tom Tromey from comment #0)
> Another approach is to re-read all the units for each
> language, though this could hugely impact the scanning time.

Huh, I thought that this is how it worked, that the partial unit was re-read in the context of each time it's DW_AT_import'ed, as if it was really copy pasted at each import location.

I see that in DWARF 5, a partial unit can have a DW_AT_language.  Do we ever use this one?
Comment 2 Tom Tromey 2025-11-25 22:50:48 UTC
(In reply to Simon Marchi from comment #1)
> (In reply to Tom Tromey from comment #0)
> > Another approach is to re-read all the units for each
> > language, though this could hugely impact the scanning time.
> 
> Huh, I thought that this is how it worked, that the partial unit was re-read
> in the context of each time it's DW_AT_import'ed, as if it was really copy
> pasted at each import location.

That's how DWARF seems to conceptualize it but this is very expensive
and gdb's implementation just assumes that each import is only at the
top level, since this is what dwz does and dwz is the only known producer
of this output.
 
> I see that in DWARF 5, a partial unit can have a DW_AT_language.  Do we ever
> use this one?

Yes, if the DIE has the attribute, it will be used.
See cutu_reader::prepare_one_comp_unit
Comment 3 Simon Marchi 2025-11-26 04:10:57 UTC
(In reply to Tom Tromey from comment #2)
> (In reply to Simon Marchi from comment #1)
> > (In reply to Tom Tromey from comment #0)
> > > Another approach is to re-read all the units for each
> > > language, though this could hugely impact the scanning time.
> > 
> > Huh, I thought that this is how it worked, that the partial unit was re-read
> > in the context of each time it's DW_AT_import'ed, as if it was really copy
> > pasted at each import location.
> 
> That's how DWARF seems to conceptualize it but this is very expensive
> and gdb's implementation just assumes that each import is only at the
> top level, since this is what dwz does and dwz is the only known producer
> of this output.

But it would be just as expensive than if the de-duplication didn't happen, right?

>  
> > I see that in DWARF 5, a partial unit can have a DW_AT_language.  Do we ever
> > use this one?
> 
> Yes, if the DIE has the attribute, it will be used.
> See cutu_reader::prepare_one_comp_unit

I had another question while preparing the DW_IDX_* DWARF proposal.  I don't really see how an index is supposed to work with partial units.  Suppose that you have two CUs that you run through a compression tool like dwz.  The tool identifies a common sub-tree between the two CUs.  It moves that sub-tree to a partial unit and replaces the two instances in the CUs with some DW_TAG_imported_unit DIEs.  The sub-tree moved to the PU contains some names that should be indexed.  What should be in the (.debug_names) index?

Index entries must point to a specific CU, and to a specific DIE by giving the offset within that CU.  Should partial units be in the index CU list?  Partial units are actually called "partial compilation units" (as opposed to full compilation units), so based on the vocabulary... yes?

But even if an index entry pointed to a DIE in a PU, would it be useful to consumers?  I guess not, since the contents of the PU doesn't make sense on its own.

I was thinking that such an entry would also need to reference a full CU that imports the PU, so that the consumer knows how to reach that DIE with the right context.

It would be similar to how it works with foreign type units.  From DWARF 5:

> When an index entry refers to a foreign type unit, it may have attributes for both CU and (foreign) TU. For such entries, the CU attribute gives the consumer a reference to the CU that may be used to locate a split DWARF object file that
contains the type unit.
Comment 4 Tom Tromey 2025-11-26 15:19:20 UTC
> But it would be just as expensive than if the de-duplication didn't happen,
> right?

I think it would have to be more expensive since (1) each new PU has
some fixed overhead and (2) a PU might not be minimal so excess reading
may be required.

Anyway this is an option if we want it.

> I had another question while preparing the DW_IDX_* DWARF proposal.  I don't
> really see how an index is supposed to work with partial units.  Suppose
> that you have two CUs that you run through a compression tool like dwz.  The
> tool identifies a common sub-tree between the two CUs.  It moves that
> sub-tree to a partial unit and replaces the two instances in the CUs with
> some DW_TAG_imported_unit DIEs.  The sub-tree moved to the PU contains some
> names that should be indexed.  What should be in the (.debug_names) index?

This was mentioned in that patch I linked to in the other bug.
Here's the text I'll send when that series is ready:

+@item
+Definitions in partial units are handled differently.  These most
+typically are seen in the output of @code{dwz}.
+
+In general, a DWARF partial unit cannot be read in isolation, but only
+by reading it in the context of some other unit that references it via
+@code{DW_TAG_imported_unit}.
+
+Therefore, an ordinary definition in a partial unit is attributed to
+one of the outermost containing units.  This is done by referencing
+this containing CU in the @code{DW_IDX_compile_unit} attribute.
+
+A further special case applies to @code{DW_TAG_inlined_subroutine}
+entries.  An inlined subroutine appearing in a partial unit may be
+inlined in all of the outermost compilation units that directly or
+indirectly include the partial unit.  Therefore, in this case,
+@value{GDBN} will emit a separate index entry for the entry, once for
+each such containing unit.
Comment 5 Tom Tromey 2025-11-26 15:59:47 UTC
Found the other bug I mentioned.
Comment 6 Tom Tromey 2025-11-26 16:06:21 UTC
My series for #30728 touches on this area a little.
There I reason that if a PU is shared across languages,
it most likely is semantically valid for both, and so
the discrepancy can be ignored.

While this is probably true in practice, note that
it's not really guaranteed.  For example an array
type could be shared by C and Ada, and if the lower
bound were omitted it would validly describe two
different types.  I consider this unlikely to happen,
though, given the practicalities of the DWARF output
(e.g., Ada emits encoded names only).
Comment 8 Sourceware Commits 2026-01-23 16:13:26 UTC
The master branch has been updated by Tom Tromey <tromey@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=329a53a6d590e2e90f590c89473990040a86c8e0

commit 329a53a6d590e2e90f590c89473990040a86c8e0
Author: Tom Tromey <tom@tromey.com>
Date:   Sat Nov 22 11:03:57 2025 -0700

    Some cleanups to "pretend language" handling
    
    I noticed that the "pretend language" handling in the DWARF reader
    doesn't work as intended; the problem code in dwarf2_per_cu::set_lang
    is:
    
      if (unit_type () == DW_UT_partial)
        return;
    
    The issue here is that this subverts the very purpose of having a
    "pretend" language.
    
    Some background: when Jakub wrote dwz, we also added support for this
    style of DWARF compression to gdb.  Now, dwz only shares DIEs in a
    "top level" way -- i.e., at the time (and as far as I know, continuing
    to today), it would not emit a DW_TAG_imported_unit inside a
    namespace.  So, when implementing this we also implemented an
    optimization, namely that gdb would not re-read every imported unit a
    la '#include', but instead would make symtabs for each included unit
    (partial units didn't yet exist).
    
    However, an imported/partial unit might not have a language -- but a
    language is necessary for interpreting the DIEs.  This is where the
    "pretend" language comes from.  When reading a CU, any included
    partial units that do not have a language of their own will inherit
    that CU's language.
    
    This patch started by removing the DW_UT_partial check.  This of
    course caused assertion failures in some modes, as set_lang also
    asserts that the language cannot change.  But, it's possible for a CU
    to be prepared multiple times, and for different invocations to
    provide different languages.
    
    This is not a scenario we allowed for in the early days.  Nowadays,
    though, it seems to me that it's basically fine in practice, with the
    reason being that sharing DIEs that differ semantically but not
    syntactically across different languages is hard to achieve.
    
    We do see this some cross-language sharing in a limited way -- "dwz
    -5" will emit inclusions from both C and C++ CUs for the
    gdb.fortran/mixed-lang-stack.exp test -- but note that this sharing is
    limited to things that are common between C and C++, like "float".
    
    Therefore this patch replaces the assertions in set_lang with some
    compare-exchanges.
    
    Finally I changed cutu_reader to use a std::optional for the pretend
    language.  I think this makes it more clear what is happening.  And,
    while doing this I found a spot in the cooked indexer where
    language_minimal was passed in, but where the importing CU's language
    should have been used.
    
    I regression tested this on x86-64 Fedora 40 using the default board,
    plus the cc-with-gdb-index, cc-with-debug-names, and cc-with-dwz-5
    boards.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=33661
Comment 9 Tom Tromey 2026-01-23 16:14:32 UTC
Fixed.