Bug 30358 - bfd very slow (> 4 minutes) to link busybox with -Wl,--sort-section,alignment (regression in binutils-2.40) since af31506c31a59a6edbb13498d6075fa704b801cd
Summary: bfd very slow (> 4 minutes) to link busybox with -Wl,--sort-section,alignment...
Status: RESOLVED FIXED
Alias: None
Product: binutils
Classification: Unclassified
Component: ld (show other bugs)
Version: 2.41
: P2 normal
Target Milestone: ---
Assignee: Michael Matz
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-04-15 04:14 UTC by Sam James
Modified: 2023-05-06 12:34 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
ps faux output (930 bytes, text/plain)
2023-04-15 04:30 UTC, Sam James
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sam James 2023-04-15 04:14:16 UTC
bfd seems to take > 4 minutes to link busybox-1.35.0 on my machine (AMD 3590X w/ an nvme disk, 64GB RAM), while lld takes 2 seconds.

It seems to peak at around 780MB RAM usage too.

The slow command is:
```
$ time x86_64-pc-linux-gnu-gcc -fuse-ld=bfd -O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches -Wreturn-type -ggdb3 -fno-strict-aliasing -Werror=format-security -Werror=implicit-function-declaration -Wimplicit-int -Werror=int-conversion -Wformat -Wall -Wshadow -Wwrite-strings -Wundef -Wstrict-prototypes -Wunused -Wunused-parameter -Wunused-function -Wunused-value -Wmissing-prototypes -Wmissing-declarations -Wno-format-security -Wdeclaration-after-statement -Wold-style-definition -finline-limit=0 -fno-builtin-strlen -ffunction-sections -fdata-sections -fno-guess-branch-probability -funsigned-char -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-builtin-printf -static -Wl,-O1 -Wl,--as-needed -Wl,--defsym=__gentoo_check_ldflags__=0 -Wl,-z,pack-relative-relocs -ggdb3 -o busybox_unstripped -Wl,--sort-common -Wl,--sort-section,alignment -Wl,--start-group applets/built-in.o archival/lib.a archival/libarchive/lib.a console-tools/lib.a coreutils/lib.a coreutils/libcoreutils/lib.a debianutils/lib.a klibc-utils/lib.a e2fsprogs/lib.a editors/lib.a findutils/lib.a init/lib.a libbb/lib.a libpwdgrp/lib.a loginutils/lib.a mailutils/lib.a miscutils/lib.a modutils/lib.a networking/lib.a networking/libiproute/lib.a networking/udhcp/lib.a printutils/lib.a procps/lib.a runit/lib.a selinux/lib.a shell/lib.a sysklogd/lib.a util-linux/lib.a util-linux/volume_id/lib.a archival/built-in.o archival/libarchive/built-in.o console-tools/built-in.o coreutils/built-in.o coreutils/libcoreutils/built-in.o debianutils/built-in.o klibc-utils/built-in.o e2fsprogs/built-in.o editors/built-in.o findutils/built-in.o init/built-in.o libbb/built-in.o libpwdgrp/built-in.o loginutils/built-in.o mailutils/built-in.o miscutils/built-in.o modutils/built-in.o networking/built-in.o networking/libiproute/built-in.o networking/udhcp/built-in.o printutils/built-in.o procps/built-in.o runit/built-in.o selinux/built-in.o shell/built-in.o sysklogd/built-in.o util-linux/built-in.o util-linux/volume_id/built-in.o -Wl,--end-group -Wl,--start-group -lcrypt -lm -lresolv -lrt -Wl,--end-group
real    4m22.598s
user    4m20.832s
sys     0m1.088s
```

```
$ gcc --version
gcc (Gentoo Hardened 13.0.1_pre20230409-r4 p9) 13.0.1 20230409 (experimental)
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ld --version
GNU ld (Gentoo 2.40 p4) 2.40.0
Copyright (C) 2023 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
```
Comment 1 Sam James 2023-04-15 04:14:38 UTC
If I run 'perf record ...' (... = command above), I get this from 'perf report':

# Event count (approx.): 1111543334586
#
# Overhead  Command          Shared Object                                  Symbol
# ........  ...............  .............................................  .......................................................
#
    97.85%  ld.bfd           ld.bfd                                         [.] output_section_callback_sort
     0.57%  ld.bfd           libz.so.1.2.13                                 [.] inflate_fast
     0.19%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] sec_merge_hash_lookup.lto_priv.0
     0.06%  ld.bfd           libz.so.1.2.13                                 [.] adler32_z
     0.05%  ld.bfd           [zzstd]                                        [k] ZSTD_decompressSequences_bmi2.constprop.0
     0.05%  ld.bfd           [kernel.kallsyms]                              [k] copy_user_generic_string
     0.04%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] bfd_lookup_arch
     0.03%  ld.bfd           [kernel.kallsyms]                              [k] clear_page_rep
     0.03%  ld.bfd           [kernel.kallsyms]                              [k] stackleak_erase
     0.02%  ld.bfd           [kernel.kallsyms]                              [k] psi_account_irqtime
     0.02%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] _bfd_merged_section_offset.isra.0
     0.02%  ld.bfd           ld.bfd                                         [.] walk_wild.lto_priv.0
     0.02%  ld.bfd           [kernel.kallsyms]                              [k] update_cfs_group
     0.02%  ld.bfd           ld.bfd                                         [.] resolve_wild_sections
     0.02%  ld.bfd           [lz4_compress]                                 [k] LZ4_compress_fast_extState
     0.02%  ld.bfd           libz.so.1.2.13                                 [.] inflate_table
     0.02%  ld.bfd           [kernel.kallsyms]                              [k] asm_exc_page_fault
     0.01%  ld.bfd           [kernel.kallsyms]                              [k] memset
     0.01%  ld.bfd           libc.so.6                                      [.] 0x000000000015505d
     0.01%  ld.bfd           [kernel.kallsyms]                              [k] update_curr
     0.01%  ld.bfd           [kernel.kallsyms]                              [k] exit_to_user_mode_prepare
     0.01%  ld.bfd           ld.bfd                                         [.] lang_process
     0.01%  ld.bfd           libc.so.6                                      [.] malloc
     0.01%  ld.bfd           [kernel.kallsyms]                              [k] asm_sysvec_apic_timer_interrupt
     0.01%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] _bfd_elf_link_iterate_on_relocs
     0.01%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] bfd_hash_lookup
     0.01%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] _bfd_elf_merge_sections
     0.01%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] bfd_elf_final_link
     0.01%  ld.bfd           [kernel.kallsyms]                              [k] task_tick_fair
     0.01%  ld.bfd           [kernel.kallsyms]                              [k] __handle_mm_fault
     0.01%  ld.bfd           libz.so.1.2.13                                 [.] inflate
     0.01%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] _bfd_elf_make_section_from_shdr
     0.01%  ld.bfd           libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so  [.] elf_link_read_relocs_from_section
[...]

(Tarball with files coming.)
Comment 2 Sam James 2023-04-15 04:23:31 UTC
(In reply to Sam James from comment #1)
> (Tarball with files coming.)

This contains the object files, command used, and perf.data: https://dev.gentoo.org/~sam/bugs/sourceware/30358/busybox-hangs.tar.xz
Comment 3 Sam James 2023-04-15 04:30:49 UTC
Created attachment 14829 [details]
ps faux output
Comment 4 Sam James 2023-04-15 04:38:11 UTC
To get the hang `-Wl,--sort-section,alignment` is sufficient:
```
x86_64-pc-linux-gnu-gcc -static -o busybox_unstripped -Wl,--sort-section,alignment -Wl,--start-group applets/built-in.o archival/lib.a archival/libarchive/lib.a console-tools/lib.a coreutils/lib.a coreutils/libcoreutils/lib.a debianutils/lib.a klibc-utils/lib.a e2fsprogs/lib.a editors/lib.a findutils/lib.a init/lib.a libbb/lib.a libpwdgrp/lib.a loginutils/lib.a mailutils/lib.a miscutils/lib.a modutils/lib.a networking/lib.a networking/libiproute/lib.a networking/udhcp/lib.a printutils/lib.a procps/lib.a runit/lib.a selinux/lib.a shell/lib.a sysklogd/lib.a util-linux/lib.a util-linux/volume_id/lib.a archival/built-in.o archival/libarchive/built-in.o console-tools/built-in.o coreutils/built-in.o coreutils/libcoreutils/built-in.o debianutils/built-in.o klibc-utils/built-in.o e2fsprogs/built-in.o editors/built-in.o findutils/built-in.o init/built-in.o libbb/built-in.o libpwdgrp/built-in.o loginutils/built-in.o mailutils/built-in.o miscutils/built-in.o modutils/built-in.o networking/built-in.o networking/libiproute/built-in.o networking/udhcp/built-in.o printutils/built-in.o procps/built-in.o runit/built-in.o selinux/built-in.o shell/built-in.o sysklogd/built-in.o util-linux/built-in.o util-linux/volume_id/built-in.o -Wl,--end-group -Wl,--start-group -lcrypt -lm -lresolv -lrt -Wl,--end-group -fuse-ld=bfd
```
Comment 5 Sam James 2023-04-15 04:46:50 UTC
(In reply to Sam James from comment #4)
> To get the hang `-Wl,--sort-section,alignment` is sufficient:
> ```

Happens with -Wl,--sort-section,name too.
Comment 6 Sam James 2023-04-15 04:49:45 UTC
If I attach gdb after it's been running for a while:
```
0x000055af4c8e93c1 in compare_section (bsec=<optimized out>, asec=<optimized out>, sort=<optimized out>) at ../bfd/bfd.h:1115
1115      return sec->name;
(gdb) bt
#0  0x000055af4c8e93c1 in compare_section (bsec=<optimized out>, asec=<optimized out>, sort=<optimized out>) at ../bfd/bfd.h:1115
#1  wild_sort (section=0x55af6efa7c58, file=0x55af6ef3fc30, sec=0x55af4ddd2d20, wild=0x55af4ddd28a0) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:652
#2  output_section_callback_sort (ptr=0x55af4ddd28a0, sec=0x55af4ddd2d20, section=0x55af6efa7c58, file=0x55af6ef3fc30, output=<optimized out>)
    at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:685
#3  0x000055af4c8dd995 in walk_wild (s=0x55af4ddd28a0, callback=0x55af4c8e92f0 <output_section_callback_sort>, data=0x55af4ddd18f0) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:979
#4  0x000055af4c8ddc3e in wild (target=<optimized out>, output=<optimized out>, s=0x55af4ddd28a0) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:3052
#5  map_input_to_output_sections.isra.0 (s=0x55af4ddd28a0, os=os@entry=0x55af4ddd18f0, target=<optimized out>) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:4076
#6  0x000055af4c8ddb81 in map_input_to_output_sections.isra.0 (s=0x55af4ddd18f0, os=0x0, target=<optimized out>) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:4094
#7  0x000055af4c8dadb1 in lang_process () at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:8158
#8  0x000055af4c8e185a in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldmain.c:504
(gdb)
```
Comment 7 Sam James 2023-04-19 08:48:22 UTC
binutils-2.39 completes quickly, binutils-2.40 doesn't.
Comment 8 Sam James 2023-04-19 09:20:46 UTC
$ git bisect bad
af31506c31a59a6edbb13498d6075fa704b801cd is the first bad commit
commit af31506c31a59a6edbb13498d6075fa704b801cd
Author: Michael Matz <matz@suse.de>
Date:   Thu Nov 10 16:06:20 2022 +0100

    Only use wild_sort_fast

    there's no reason why the tree-based variant can't always be used
    when sorting is required, it merely needs to also support filename
    sorting and have a fast path for insertion at end (aka rightmost tree
    leaf).

    The filename sorting isn't tested anywhere and the only scripttempl
    that uses it is avr (for 'SORT(*)(.ctors)'), and I believe even there it
    was a mistake.  Either way, this adds a testcase for filename sorting as
    well.

    Then the non-BST based sorting can be simplified to only support
    the fast case of no sorting required at all (at the same time renaming
    the two variants to _sort and _nosort).

 ld/ldlang.c                          | 302 +++++++++++++++--------------------
 ld/ldlang.h                          |   3 +-
 ld/testsuite/ld-scripts/sort-file.d  |  18 +++
 ld/testsuite/ld-scripts/sort-file.t  |   6 +
 ld/testsuite/ld-scripts/sort-file1.s |   6 +
 ld/testsuite/ld-scripts/sort-file2.s |   6 +
 6 files changed, 163 insertions(+), 178 deletions(-)
 create mode 100644 ld/testsuite/ld-scripts/sort-file.d
 create mode 100644 ld/testsuite/ld-scripts/sort-file.t
 create mode 100644 ld/testsuite/ld-scripts/sort-file1.s
 create mode 100644 ld/testsuite/ld-scripts/sort-file2.s
Comment 9 Sam James 2023-04-19 09:26:45 UTC
The patch for bug 30367 (https://sourceware.org/pipermail/binutils/2023-April/127120.html) unfortunately doesn't help.
Comment 10 Michael Matz 2023-04-19 14:59:15 UTC
Yeah, that patch wouldn't help, it's in a different area.  Thanks for the
complete testcase.  I'm going to have a look here.
Comment 11 Michael Matz 2023-04-19 17:20:26 UTC
The old (insertion-sort) code only added something to the output section list
if the considered section wasn't already either discarded or linked (by being
part of a group for instance).  That is done by lang_add_section.
This output section list is the sorted container into which further insertions
are done (and hence its length is the one determining performance).

The binary search tree code always adds all candidates to the tree (which in our
unlucky case here mostly degrades to a linked list), and only then goes through
that tree to linearize it to a list which doesn't contain the discarded or
already linked input sections (group members).

So, the intermediate tree contains things that aren't ultimately going to be
output, blowing performance down the drain.  Otherwise the old insertion-sort
code and the now always used tree-based code are equivalent here.  But the example contains _many_ groups, and all of them have a .debug_macro section
(and only that!) and it exists in many input files, so the difference is
a search-chain length of about 1500 in the insertion sort case and about 106000 in the binary tree case, and that factor goes in quadraticly into performance
(as said, the search tree is degraded in the example).

So, the solution is obvious: don't add something to the tree if it won't be
added to the linearized list later.  That requires some refactoring.

A hacky un-refactored patch for the above is below (relative to master).
It restores performance.

diff --git a/ld/ldlang.c b/ld/ldlang.c
index 4bec280b9df..7e0a9db51e3 100644
--- a/ld/ldlang.c
+++ b/ld/ldlang.c
@@ -687,6 +687,19 @@ output_section_callback_sort (lang_wild_statement_type *ptr,
   if (unique_section_p (section, os))
     return;
 
+  /* Don't add sections to the tree when we already know that
+     lang_add_section won't do anything with it.  */
+  if (section->output_section != NULL)
+    {
+      if (!link_info.non_contiguous_regions)
+       return;
+
+      /* SECTION has already been handled in a special way
+        (eg. LINK_ONCE): skip it.  */
+      if (bfd_is_abs_section (section->output_section))
+       return;
+    }
+
   node = (lang_section_bst_type *) xmalloc (sizeof (lang_section_bst_type));
   node->left = 0;
   node->right = 0;
Comment 12 Sam James 2023-04-20 04:29:19 UTC
Not tested the patch yet, but Matz, I'd like to say thanks for the insightful explanation. Really appreciated!
Comment 13 Sourceware Commits 2023-04-27 14:59:11 UTC
The master branch has been updated by Michael Matz <matz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=670c91c0c5eb3fb16d42fe5b2d48a33c64e3ec52

commit 670c91c0c5eb3fb16d42fe5b2d48a33c64e3ec52
Author: Michael Matz <matz@suse.de>
Date:   Tue Apr 25 17:10:05 2023 +0200

    Fix PR30358, performance with --sort-section
    
    since af31506c we only use the binary tree when section sorting is
    required.  While its unbalanced and hence can degrade to a linear list
    it should otherwise have been equivalent to the old code relying on
    insertion sort.  Unfortunately it was not.  The old code directly used
    lang_add_section to populate the sorted list, the new code first
    populates the tree and only then does lang_add_section on the sorted
    result.
    
    In the testcase we have very many linkonce section groups, and hence
    lang_add_section won't actually insert anything for most of them.  That
    limited the to-be-sorted list length previously.  The tree-sorting code
    OTOH first created a tree of all candidates sections, including those
    that wouldn't be inserted by lang_add_section, hence increasing the size
    of the sorting problem.  In the testcase the chain length went from
    about 1500 to 106000, and in the degenerated case (as in the testcase)
    that goes in quadratically.
    
    This splits out most of the early-out code from lang_add_section to its
    own function and uses the latter to avoid inserting into the tree.  This
    refactoring slightly changes the order of early-out tests (the ones
    based on section flags is now done last, and only in lang_add_section).
    The new function is not a pure predicate: it can give warnings and it
    might change output_section, like the old early-out code did.  I have
    also added a skip-warning case in the first discard case, whose
    non-existence seemed to have been an oversight.
    
            PR 30358
            * ldlang.c (wont_add_section_p): Split out from ...
            (lang_add_section): ... here.
            (output_section_callback_sort): Use wont_add_section_p to not
            always add sections to the sort tree.
Comment 14 Michael Matz 2023-04-27 15:04:23 UTC
Fixed in master.
Comment 15 Sam James 2023-04-28 06:20:27 UTC
Thank you! Do you think the hacky patch (not the one committed) would be good enough for us to backport downstream for 2.40?
Comment 16 Michael Matz 2023-05-02 12:17:52 UTC
Yes, the hacky patch is fine for you to use, it won't introduce further problems
(knock on wood! :) ).
Comment 17 Andreas K. Huettel 2023-05-06 12:33:16 UTC
(In reply to Michael Matz from comment #16)
> Yes, the hacky patch is fine for you to use, it won't introduce further
> problems
> (knock on wood! :) ).

Michael: I just added after some tests that patch to Gentoo testing (i.e. Gentoo binutils-2.40-r5). It'll get a lot of usage now. 

If nothing pops up over the next 2-3 weeks, I could bring the upstream 2.40 backports branch to the same level, if that's OK with you.