Bug 23710 - gdb is slow and memory hungry consuming debug generated with LTO by GCC
Summary: gdb is slow and memory hungry consuming debug generated with LTO by GCC
Status: NEW
Alias: None
Product: gdb
Classification: Unclassified
Component: gdb (show other bugs)
Version: 8.2.1
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-25 13:11 UTC by Richard Biener
Modified: 2023-01-19 02:25 UTC (History)
12 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2018-09-25 13:11:35 UTC
"Mirror" of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87432 where we try to improve the GCC side.  Input from gdb folks is needed here.

gdb takes ~10s to process a LTO bootstrapped cc1 binary and another two seconds
when setting the first breakpoint.  It also has allocated 1.6GB memory at
that point (compared to ~200MB for a non-LTO binary).
Comment 1 Richard Biener 2018-09-25 13:57:04 UTC
There's http://www.suse.de/~rguenther/cc1.xz you can look at.  It will eventually run into PR23712 if you poke enough but it survives plain gdb startup and 'start' for me.
Comment 2 Jan Hubicka 2019-01-20 08:42:51 UTC
I can re-confirm this with current trunk of gdb and current trunk of gcc. Things got even slower between gcc 8 and gcc 9. I have uploaded fresh cc1plus binary to http://www.ucw.cz/~hubicka/cc1plus.gz

It takes a while to load binary and then it takes a while each time function names are output - i.e. when adding breakpoint, looking on backtraces etc.

Things are a lot worse on Firefox libxul that I can upload too (or it reproduces with RedHat Fedora package that is now LTO built)

LLDB seems to have no problems. It would be great to have this fixed since it is quite a problem for adoption of lto.
Comment 3 Jan Hubicka 2019-01-23 20:43:11 UTC
I have spent some time looking into this.  Note that it also affect gdb index generation that can take well over 10 minutes on bigger binaries built with LTO, 3 minutes and 1.4GB on cc1plus.

If I comment out code in load_partial_dies that construct partial_die_info data-structure the startup gets under control.  It seems that this is processing every DW_TAG_imported_unit and for each of it it allocates the info data-structure that by itself is 104 bytes.

There are 21830 DW_TAG_imported_unit in cc1plus binary built with LTO. We default to 128 partitions, so it is 170 per partition.

For some reason 48623044 partial die infos are constructed during gdb index, so it seems it is tripping same partial dies over and over again?
Comment 4 rguenther 2019-01-24 08:51:20 UTC
On Wed, 23 Jan 2019, hubicka at gcc dot gnu.org wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=23710
> 
> --- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
> I have spent some time looking into this.  Note that it also affect gdb index
> generation that can take well over 10 minutes on bigger binaries built with
> LTO, 3 minutes and 1.4GB on cc1plus.
> 
> If I comment out code in load_partial_dies that construct partial_die_info
> data-structure the startup gets under control.  It seems that this is
> processing every DW_TAG_imported_unit and for each of it it allocates the info
> data-structure that by itself is 104 bytes.
> 
> There are 21830 DW_TAG_imported_unit in cc1plus binary built with LTO. We
> default to 128 partitions, so it is 170 per partition.
> 
> For some reason 48623044 partial die infos are constructed during gdb index, so
> it seems it is tripping same partial dies over and over again?

That sounds like a possible explanation.

Still it would be interesting to see why so many (well, 170 per partition
isn't _so_ many) TRANSLATION_UNIT_DECLs end up in the individual
LTRANS units (they'll be the ultimate origin of some decl).
Comment 5 Tom de Vries 2020-02-06 14:29:30 UTC
(In reply to Richard Biener from comment #0)
> gdb takes ~10s to process a LTO bootstrapped cc1 binary and another two
> seconds
> when setting the first breakpoint.  It also has allocated 1.6GB memory at
> that point (compared to ~200MB for a non-LTO binary).

There's a setting "maint set/show dwarf max-cache-age" which defaults to 5.

Using a higher setting, I get the following reduction in real execution time:
- 10    :  1.5%
- 100   : 12.5%
- 316   : 16.5%
- 1000  : 16.5%
- 10000 : 16.5%
- 100000: 15.5%

Note: adding the setting to the gdb command line using -iex to make sure it gets set _before_ loading the exec):
...
$ gdb -q -nw -nx -batch -iex "maint set dwarf max-cache-age $n" -ex "b do_rpo_vn" cc1
...

Conversely, disabling the cache by setting the value to 0 causes a real execution time increase of 46%.
Comment 6 Tom de Vries 2020-02-26 12:54:42 UTC
(In reply to Richard Biener from comment #0)
> "Mirror" of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87432 where we try
> to improve the GCC side.  Input from gdb folks is needed here.
> 
> gdb takes ~10s to process a LTO bootstrapped cc1 binary and another two
> seconds
> when setting the first breakpoint.  It also has allocated 1.6GB memory at
> that point (compared to ~200MB for a non-LTO binary).

The proposed patch at https://sourceware.org/ml/gdb-patches/2020-02/msg00974.html allows a speedup when manually specifying the source language before loading, which takes 17% off the execution time of loading and setting the first breakpoint.
Comment 7 Tom de Vries 2020-02-26 13:38:25 UTC
Measuring the two speed workarounds mentioned in comment 5 and 6, it seems they're complementary:

With loading main full symtab and default max-cache-age:
...
$ ../measure/time.sh ../gdb.sh cc1 -batch -ex "b do_rpo_vn" -iex "maint set dwarf max-cache-age 5"
Breakpoint 1 at 0xd40e30: do_rpo_vn. (2 locations)
maxmem: 1463412
real: 8.92
user: 8.52
system: 0.46
...

With skipping main full symtab and increased max-cache-age:
...
$ ../measure/time.sh ../gdb.sh -iex "set language c++" cc1 -batch -ex "b do_rpo_vn" -iex "maint set dwarf max-cache-age 316"
Breakpoint 1 at 0xd40e30: do_rpo_vn. (2 locations)
maxmem: 1066220
real: 6.34
user: 6.09
system: 0.31
...

That is: reduction of user execution time by 28,5%.
Comment 8 Tom de Vries 2020-03-02 08:31:45 UTC
(In reply to Tom de Vries from comment #6)
> (In reply to Richard Biener from comment #0)
> > "Mirror" of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87432 where we try
> > to improve the GCC side.  Input from gdb folks is needed here.
> > 
> > gdb takes ~10s to process a LTO bootstrapped cc1 binary and another two
> > seconds
> > when setting the first breakpoint.  It also has allocated 1.6GB memory at
> > that point (compared to ~200MB for a non-LTO binary).
> 
> The proposed patch at
> https://sourceware.org/ml/gdb-patches/2020-02/msg00974.html allows a speedup
> when manually specifying the source language before loading, which takes 17%
> off the execution time of loading and setting the first breakpoint.

That patch has been accepted.

I've submitted a RFC patch that automates this workaround: https://sourceware.org/ml/gdb-patches/2020-03/msg00009.html .
Comment 9 Tom de Vries 2020-03-02 11:26:23 UTC
The cc1 executable contains CUs importing other CUs (as opposed to PUs). By treating these imports as hints and ignoring them during during symtab expansion, we can shave off another 8.3%.

Cumulatively, that gets us to user time reduction of 33.5%.

Tentative patch:
...
diff --git a/gdb/dwarf2/read.c b/gdb/dwarf2/read.c
index 07cee58c1f..a2a6889f73 100644
--- a/gdb/dwarf2/read.c
+++ b/gdb/dwarf2/read.c
@@ -7425,6 +7425,18 @@ process_psymtab_comp_unit (struct dwarf2_per_cu_data *this_cu,
 
   cutu_reader reader (this_cu, NULL, 0, false);
 
+  switch (reader.comp_unit_die->tag)
+    {
+    case DW_TAG_compile_unit:
+      this_cu->unit_type = DW_UT_compile;
+      break;
+    case DW_TAG_partial_unit:
+      this_cu->unit_type = DW_UT_partial;
+      break;
+    default:
+      abort ();
+    }
+
   if (reader.dummy_p)
     {
       /* Nothing.  */
@@ -9760,6 +9772,9 @@ process_imported_unit_die (struct die_info *die, struct dwarf2_cu *cu)
        = dwarf2_find_containing_comp_unit (sect_off, is_dwz,
                                            cu->per_cu->dwarf2_per_objfile);
 
+      if (per_cu->unit_type == DW_UT_compile)
+       return;
+
       /* If necessary, add it to the queue and load its DIEs.  */
       if (maybe_queue_comp_unit (cu, per_cu, cu->language))
        load_full_comp_unit (per_cu, false, cu->language);
diff --git a/gdb/dwarf2/read.h b/gdb/dwarf2/read.h
index 00652c2b45..58b80d4821 100644
--- a/gdb/dwarf2/read.h
+++ b/gdb/dwarf2/read.h
@@ -323,6 +323,8 @@ struct dwarf2_per_cu_data
      dummy CUs (a CU header, but nothing else).  */
   struct dwarf2_cu *cu;
 
+  enum dwarf_unit_type unit_type;
+
   /* The corresponding dwarf2_per_objfile.  */
   struct dwarf2_per_objfile *dwarf2_per_objfile;
 
...
Comment 10 Tom de Vries 2020-03-02 16:16:27 UTC
(In reply to Tom de Vries from comment #9)
> The cc1 executable contains CUs importing other CUs (as opposed to PUs). By
> treating these imports as hints and ignoring them during during symtab
> expansion, we can shave off another 8.3%.
> 
> Cumulatively, that gets us to user time reduction of 33.5%.
> 
> Tentative patch:

Submitted: https://sourceware.org/ml/gdb-patches/2020-03/msg00026.html .
Comment 11 Tom de Vries 2020-03-08 11:09:59 UTC
Comparison, cc1 vs cc1.dwz (produced using dwz build from current master branch):
...
$ diff.sh cc1 cc1.dwz
.debug_info      red: 49.30%    97418416  49399513
.debug_abbrev    red: 42.04%     1699940    985372
.debug_str       red: 0%         6344030   6344030
total            red: 46.21%   105462386  56728915
...

lldb roughly uses same amount of memory, that is: cc1.dwz uses 5.7% less:
...
$ time.sh lldb -batch cc1 -o "b do_rpo_vn"
(lldb) target create "cc1"
Current executable set to 'cc1' (x86_64).
(lldb) b do_rpo_vn
Breakpoint 1: 3 locations.
maxmem: 519116
real: 2.63
user: 4.21
system: 0.14
$ time.sh lldb -batch cc1.dwz -o "b do_rpo_vn"
(lldb) target create "cc1.dwz"
Current executable set to 'cc1.dwz' (x86_64).
(lldb) b do_rpo_vn
Breakpoint 1: 3 locations.
maxmem: 489596
real: 2.78
user: 4.01
system: 0.10
...

With gdb, the difference is a reduction of 51.9%:
...
$ time.sh gdb cc1 -batch -iex "set language c++" -iex "maint set dwarf max-cache-age 316" -ex "b do_rpo_vn"
Breakpoint 1 at 0xd40e30: do_rpo_vn. (2 locations)
maxmem: 999404
real: 7.03
user: 6.81
system: 0.25
$ time.sh gdb cc1.dwz -batch -iex "set language c++" -iex "maint set dwarf max-cache-age 316" -ex "b do_rpo_vn"
Breakpoint 1 at 0xd40e30: do_rpo_vn. (2 locations)
maxmem: 481152
real: 6.15
user: 6.09
system: 0.12
...
Comment 12 Jan Kratochvil 2020-03-08 13:14:01 UTC
(In reply to Tom de Vries from comment #11)
> $ time.sh lldb -batch cc1.dwz -o "b do_rpo_vn"

FYI upstream LLDB does not yet support DWZ to make such measurement valid.
There is an off-trunk patchset for it:
  git clone -b dwz git://git.jankratochvil.net/lldb
It works but it is still being refactored to get it accepted upstream.
Comment 13 Sourceware Commits 2020-03-17 07:56:45 UTC
The master branch has been updated by Tom de Vries <vries@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=589902954da0d1dd140b33e578954746c9bfc374

commit 589902954da0d1dd140b33e578954746c9bfc374
Author: Tom de Vries <tdevries@suse.de>
Date:   Tue Mar 17 08:56:36 2020 +0100

    [gdb] Skip imports of c++ CUs
    
    The DWARF standard appendix E.1 describes techniques that can be used for
    compression and deduplication: DIEs can be factored out into a new compilation
    unit, and referenced using DW_FORM_ref_addr.
    
    Such a new compilation unit can either use a DW_TAG_compile_unit or
    DW_TAG_partial_unit.  If a DW_TAG_compile_unit is used, its contents is
    evaluated by consumers as though it were an ordinary compilation unit.  If a
    DW_TAG_partial_unit is used, it's only considered by consumers in the context
    of a DW_TAG_imported_unit.
    
    An example of when DW_TAG_partial_unit is required is when the factored out
    DIEs are not top-level, f.i. because they were children of a namespace.  In
    such a case the corresponding DW_TAG_imported_unit will occur as child of the
    namespace.
    
    In the case of factoring out DIEs from c++ compilation units, we can factor
    out into a new DW_TAG_compile_unit, and no DW_TAG_imported_unit is required.
    
    This begs the question how to interpret a top-level DW_TAG_imported_unit of a
    c++ DW_TAG_compile_unit compilation unit.  The semantics of
    DW_TAG_imported_unit describe that the imported unit logically appears at the
    point of the DW_TAG_imported_unit entry.  But it's not clear what the effect
    should be in this case, since all the imported DIEs are already globally
    visible anyway, due to the use of DW_TAG_compile_unit.
    
    So, skip top-level imports of c++ DW_TAG_compile_unit compilation units in
    process_imported_unit_die.
    
    Using the cc1 binary from PR23710 comment 1 and setting a breakpoint on do_rpo_vn:
    ...
    $ gdb \
        -batch \
        -iex "maint set dwarf max-cache-age 316" \
        -iex "set language c++" \
        -ex "b do_rpo_vn" \
        cc1
    ...
    we get a 8.1% reduction in execution time, due to reducing the number of
    partial symtabs expanded into full symtabs from 212 to 175.
    
    Build and reg-tested on x86_64-linux.
    
    gdb/ChangeLog:
    
    2020-03-17  Tom de Vries  <tdevries@suse.de>
    
            PR gdb/23710
            * dwarf2/read.h (struct dwarf2_per_cu_data): Add unit_type and lang
            fields.
            * dwarf2/read.c (process_psymtab_comp_unit): Initialize unit_type and lang
            fields.
            (process_imported_unit_die): Skip import of c++ CUs.
Comment 14 Tom de Vries 2020-03-20 09:06:06 UTC
(In reply to Tom de Vries from comment #5)
> (In reply to Richard Biener from comment #0)
> > gdb takes ~10s to process a LTO bootstrapped cc1 binary and another two
> > seconds
> > when setting the first breakpoint.  It also has allocated 1.6GB memory at
> > that point (compared to ~200MB for a non-LTO binary).
> 
> There's a setting "maint set/show dwarf max-cache-age" which defaults to 5.
> 
> Using a higher setting, I get the following reduction in real execution time:
> - 10    :  1.5%
> - 100   : 12.5%
> - 316   : 16.5%
> - 1000  : 16.5%
> - 10000 : 16.5%
> - 100000: 15.5%
> 
> Note: adding the setting to the gdb command line using -iex to make sure it
> gets set _before_ loading the exec):
> ...
> $ gdb -q -nw -nx -batch -iex "maint set dwarf max-cache-age $n" -ex "b
> do_rpo_vn" cc1
> ...
> 
> Conversely, disabling the cache by setting the value to 0 causes a real
> execution time increase of 46%.

Filed PR25703 - "set dwarf max-cache-age default of 5 is slow for inter-CU-reference binaries".
Comment 15 Tom de Vries 2020-04-02 12:39:27 UTC
(In reply to Tom de Vries from comment #8)
> (In reply to Tom de Vries from comment #6)
> > (In reply to Richard Biener from comment #0)
> > > "Mirror" of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87432 where we try
> > > to improve the GCC side.  Input from gdb folks is needed here.
> > > 
> > > gdb takes ~10s to process a LTO bootstrapped cc1 binary and another two
> > > seconds
> > > when setting the first breakpoint.  It also has allocated 1.6GB memory at
> > > that point (compared to ~200MB for a non-LTO binary).
> > 
> > The proposed patch at
> > https://sourceware.org/ml/gdb-patches/2020-02/msg00974.html allows a speedup
> > when manually specifying the source language before loading, which takes 17%
> > off the execution time of loading and setting the first breakpoint.
> 
> That patch has been accepted.
> 
> I've submitted a RFC patch that automates this workaround:
> https://sourceware.org/ml/gdb-patches/2020-03/msg00009.html .

Committed: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d3214198119c1a2f9a6a2b8fcc56d8c324e1a245
Comment 16 rdiezmail-binutils 2020-07-16 12:51:04 UTC
I develop a relative small firmware for ARM Cortex-M4F since some years ago, and I did not notice any slowdown up to and including GCC 8.3 and GDB 8.3.

In the past months, I upgraded my toolchain to GCC 9.3 and GDB 9.2, and then I started noticing a big slowdown of several seconds on the first "hbreak myfunction" command. I guess that is the first time that GDB loads the symbols.

The slowdown only happens with release builds compiled with LTO. With debug builds (with asserts and without LTO), GDB start-up is instantaneous.

These are the GDB RAM usage stats I collected:

Debug    firmware build: VSZ: 101 MiB, RSS  33 MiB.
Release  firmware build: VSZ: 430 MiB, RSS 362 MiB.

I do not understand why this difference, because it is exactly the same firmware. The LTO build is smaller and faster, but it has the same symbols (or less, because asserts etc. are per "#ifndef NDEBUG" no longer there).

I hope the patches above fix this issue. But I would say that the GDB's handling of LTO builds would not need a 30 % speed increase, but more like a 10 fold improvement.
Comment 17 rdiezmail-binutils 2020-11-24 11:03:13 UTC
I did some more tests after GDB 10.1 was released. I am now using GCC 10.2 .

GDB 10.1 is faster, but still pauses around 1.5 seconds. My firmware generates an LTO release build .elf file that weighs 10.4 MiB, and GDB uses a lot of RAM: VSZ 620 MiB, RSS 286 MiB.

The debug build .elf weighs 8.9 MiB, which is less even though it has more code because of all the asserts. GDB does not pause at all and uses less RAM: VSZ 372 MiB, RSS 38 MiB.

I tried setting "maintenance set worker-threads 1", because the default has changed with GDB 10.1, but it did not seem to make much of a difference, so it is probably an unrelated setting.
Comment 18 rdiezmail-binutils 2021-04-07 07:25:25 UTC
Could someone clarify the status of this issue? Is a fix going to land soon?

I would also like to know what the last version of GDB is without this issue, if any. And whether it would still work with newer GCC versions.

Yesterday, GDB used too much memory on my PC. It even managed to freeze my Linux system, to the point that not even the mouse cursor was moving. I had to pull the power cord. I don't have a swap file, but the OOM killer did not trigger.

I am using a toolchain similar to this one:

https://github.com/rdiez/JtagDue/blob/master/Toolchain/Makefile

But I am targetting an embedded ARM Cortex-M4 at the moment.

The toolchain component versions I was using were:

  BINUTILS_VERSION := 2.35.1
  GMP_VERSION := 6.2.1
  MPFR_VERSION := 4.1.0
  MPC_VERSION := 1.2.1
  GCC_VERSION := 10.2.0
  NEWLIB_VERSION := 4.0.0
  GDB_VERSION := 10.1

GDB version 10.1 was sluggish with LTO release builds, but because I normally use non-LTO debug builds, I have put up with this issue for a long time.

However, I recently upgraded these toolchain component versions and rebuilt the toolchain:

  BINUTILS_VERSION := 2.36.1
  NEWLIB_VERSION := 4.1.0

After the upgrade, GDB started using gigabytes of memory when programming the firmware.

The embedded firmware I am building has changed too, but it is very similar, so it is hard to say if any particular change, for example #including more files (which could bring in more debug symbols), has triggered the problem.
Comment 19 Tom de Vries 2021-05-31 12:14:50 UTC
(In reply to Tom de Vries from comment #7)
> Breakpoint 1 at 0xd40e30: do_rpo_vn. (2 locations)

Starting commit 77f2120b200 "Don't drop static function bp locations w/o debug info", we have 3 locations, which is caused by PR26096.
Comment 20 Tom Tromey 2021-06-10 18:35:36 UTC
Tom --

See https://sourceware.org/pipermail/gdb-patches/2021-June/179765.html
I found out that if the test case from this patch is changed
to use DW_LANG_C, it will fail.  (Of course I thinko'd that message
and wrote C++, but the test already uses C++...)

I also don't understand why the DWARF reader check is specific to C++.
It seems like any import of a CU could be skipped, since that CU
will be scanned separately anyway.

Furthermore, the skipping should probably also be done in the psymtab
reader, not just the full reader.
Comment 21 Tom de Vries 2021-06-10 22:59:48 UTC
(In reply to Tom Tromey from comment #20)
> Tom --
> 
> See https://sourceware.org/pipermail/gdb-patches/2021-June/179765.html
> I found out that if the test case from this patch is changed
> to use DW_LANG_C, it will fail.  (Of course I thinko'd that message
> and wrote C++, but the test already uses C++...)
> 

You mean gdb.dwarf2/imported-unit-bp.exp? That test-case uses C, right?

> I also don't understand why the DWARF reader check is specific to C++.
> It seems like any import of a CU could be skipped, since that CU
> will be scanned separately anyway.
> 

Tried to explain here ( https://sourceware.org/pipermail/gdb-patches/2021-June/179804.html ).  Also, I tried to explain this in the commit log here ( https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=589902954da0d1dd140b33e578954746c9bfc374 ).

> Furthermore, the skipping should probably also be done in the psymtab
> reader, not just the full reader.

Maybe, not sure yet.
Comment 22 Tom Tromey 2021-06-11 15:02:51 UTC
(In reply to Tom de Vries from comment #21)
> (In reply to Tom Tromey from comment #20)
> > Tom --
> > 
> > See https://sourceware.org/pipermail/gdb-patches/2021-June/179765.html
> > I found out that if the test case from this patch is changed
> > to use DW_LANG_C, it will fail.  (Of course I thinko'd that message
> > and wrote C++, but the test already uses C++...)
> > 
> 
> You mean gdb.dwarf2/imported-unit-bp.exp? That test-case uses C, right?

Yeah,sorry.  I double confused myself I guess.

If you convert that test to use DW_LANG_C_plus_plus, then run it, it will fail.
So it seems to me that this patch had some unintended consequence.
I haven't looked into why, and TBH it doesn't really make sense to me.

> > Furthermore, the skipping should probably also be done in the psymtab
> > reader, not just the full reader.
> 
> Maybe, not sure yet.

Normally the rule is that the psymtab reader and the full symtab reader
must agree.  Now, this case is a bit weird in that nothing really checks
whether a psymtab dependency is really read in.  Though, the above failure
seems to indicate that it may matter.
Comment 23 Tom de Vries 2021-06-22 13:24:06 UTC
(In reply to Tom Tromey from comment #22)
> (In reply to Tom de Vries from comment #21)
> > (In reply to Tom Tromey from comment #20)
> > > Tom --
> > > 
> > > See https://sourceware.org/pipermail/gdb-patches/2021-June/179765.html
> > > I found out that if the test case from this patch is changed
> > > to use DW_LANG_C, it will fail.  (Of course I thinko'd that message
> > > and wrote C++, but the test already uses C++...)
> > > 
> > 
> > You mean gdb.dwarf2/imported-unit-bp.exp? That test-case uses C, right?
> 
> Yeah,sorry.  I double confused myself I guess.
> 
> If you convert that test to use DW_LANG_C_plus_plus, then run it, it will
> fail.
> So it seems to me that this patch had some unintended consequence.
> I haven't looked into why, and TBH it doesn't really make sense to me.
> 
> > > Furthermore, the skipping should probably also be done in the psymtab
> > > reader, not just the full reader.
> > 
> > Maybe, not sure yet.
> 
> Normally the rule is that the psymtab reader and the full symtab reader
> must agree.  Now, this case is a bit weird in that nothing really checks
> whether a psymtab dependency is really read in.  Though, the above failure
> seems to indicate that it may matter.

Submitted patch to fix this: https://sourceware.org/pipermail/gdb-patches/2021-June/180229.html
Comment 24 Joseph Myers 2022-06-17 19:06:21 UTC
I've observed a case of GDB slowness on LTO code, still present with current GDB (testing here with GDB as of commit 2d9cf99d9a6c701de912d3e95ea3ffa134af4c62), that looks a bit different from the cases discussed here.

The customer test case has about 10 MB of text and about 1 GB of debug info in the main C++ application (there are also lots of shared libraries involved).  Using GDB to examine a core dump (with about 300 threads), either "info threads" or "thread apply all bt" is both very slow on a binary built with LTO (maybe 10 times slower than on a non-LTO binary) and consumes much more memory.

For the LTO binary and core dump, GDB loads the debug info for many more compilation units than in the non-LTO case, resulting in many more DIEs being loaded, process_die being called many more times (a factor of about 10) and much more time being spent in it (a large proportion of execution time in the LTO case is spent in process_die and its children).

The key difference in the debug info in the LTO and non-LTO cases that causes this is references from the debug info for one CU to the debug info for another CU, as handled by follow_die_offset. In the non-LTO case these don't occur at all. In the LTO case, there are many such references - the greatest proportion are DW_TAG_subprogram, but also various others such as DW_TAG_namespace and DW_TAG_variable.

The key call is in follow_die_offset:

      /* If necessary, add it to the queue and load its DIEs.

         Even if maybe_queue_comp_unit doesn't require us to load the CU's DIEs,
         it doesn't mean they are currently loaded.  Since we require them
         to be loaded, we must check for ourselves.  */
      if (maybe_queue_comp_unit (cu, per_cu, per_objfile, cu->per_cu->lang)
          || per_objfile->get_cu (per_cu) == nullptr)
        load_full_comp_unit (per_cu, per_objfile, per_objfile->get_cu (per_cu),
                             false, cu->per_cu->lang);

This call to load_full_comp_unit gets executed 9960 times in the LTO case, but not at all in the non-LTO case. The other call to load_full_comp_unit that gets executed is the one from load_cu (201 times in the non-LTO case, 150 in the LTO case). So the DIEs from many more CUs are loaded in the LTO case. Then process_full_comp_unit calls process_die 2250 times in the LTO case but only 186 times in the non-LTO case (and that recurses down to process all the DIEs in the CU).

The underlying issue here looks like GDB's strategy of loading all the DIEs from any CU referenced by the debug info from a CU it's loading debug info from, rather than somehow e.g. only selectively loading the DIEs it needs for the particular backtrace being printed.
Comment 25 Tom Tromey 2022-06-18 18:13:19 UTC
Changing the full reader to read DIEs on demand is almost doable.
The major problem I see here is that some code looks at a DIE's
parent -- but due to the nature of DWARF, this requires a more
full scan of the DIE tree (there are no parent links in the
.debug_info itself)

Longer term I would like to have gdb make a symtab directly
from the index, and then lazily instantiate symbol contents
when needed.  I think this would be much faster.
Comment 26 Richard Biener 2022-06-20 06:59:58 UTC
(In reply to Tom Tromey from comment #25)
> Changing the full reader to read DIEs on demand is almost doable.
> The major problem I see here is that some code looks at a DIE's
> parent -- but due to the nature of DWARF, this requires a more
> full scan of the DIE tree (there are no parent links in the
> .debug_info itself)

That's indeed an issue.  Still even here scanning for the parent
chain up to the CU header shouldn't be too expensive if you do not
materialize all other objects (maybe avoid redundant work by having
'placeholder' DIEs read-in with just TAG and sibling info, not populating
any of its actual content to avoid reading other DIEs).  mmap vs. read
might be another thing to consider here (I also wonder if the actual
DWARF might be good enough of a data format to work with for most parts
of a DIE to avoid duplicating/exploding this already large data in memory).

> Longer term I would like to have gdb make a symtab directly
> from the index, and then lazily instantiate symbol contents
> when needed.  I think this would be much faster.