bfd seems to take > 4 minutes to link busybox-1.35.0 on my machine (AMD 3590X w/ an nvme disk, 64GB RAM), while lld takes 2 seconds. It seems to peak at around 780MB RAM usage too. The slow command is: ``` $ time x86_64-pc-linux-gnu-gcc -fuse-ld=bfd -O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches -Wreturn-type -ggdb3 -fno-strict-aliasing -Werror=format-security -Werror=implicit-function-declaration -Wimplicit-int -Werror=int-conversion -Wformat -Wall -Wshadow -Wwrite-strings -Wundef -Wstrict-prototypes -Wunused -Wunused-parameter -Wunused-function -Wunused-value -Wmissing-prototypes -Wmissing-declarations -Wno-format-security -Wdeclaration-after-statement -Wold-style-definition -finline-limit=0 -fno-builtin-strlen -ffunction-sections -fdata-sections -fno-guess-branch-probability -funsigned-char -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-builtin-printf -static -Wl,-O1 -Wl,--as-needed -Wl,--defsym=__gentoo_check_ldflags__=0 -Wl,-z,pack-relative-relocs -ggdb3 -o busybox_unstripped -Wl,--sort-common -Wl,--sort-section,alignment -Wl,--start-group applets/built-in.o archival/lib.a archival/libarchive/lib.a console-tools/lib.a coreutils/lib.a coreutils/libcoreutils/lib.a debianutils/lib.a klibc-utils/lib.a e2fsprogs/lib.a editors/lib.a findutils/lib.a init/lib.a libbb/lib.a libpwdgrp/lib.a loginutils/lib.a mailutils/lib.a miscutils/lib.a modutils/lib.a networking/lib.a networking/libiproute/lib.a networking/udhcp/lib.a printutils/lib.a procps/lib.a runit/lib.a selinux/lib.a shell/lib.a sysklogd/lib.a util-linux/lib.a util-linux/volume_id/lib.a archival/built-in.o archival/libarchive/built-in.o console-tools/built-in.o coreutils/built-in.o coreutils/libcoreutils/built-in.o debianutils/built-in.o klibc-utils/built-in.o e2fsprogs/built-in.o editors/built-in.o findutils/built-in.o init/built-in.o libbb/built-in.o libpwdgrp/built-in.o loginutils/built-in.o mailutils/built-in.o miscutils/built-in.o modutils/built-in.o networking/built-in.o networking/libiproute/built-in.o networking/udhcp/built-in.o printutils/built-in.o procps/built-in.o runit/built-in.o selinux/built-in.o shell/built-in.o sysklogd/built-in.o util-linux/built-in.o util-linux/volume_id/built-in.o -Wl,--end-group -Wl,--start-group -lcrypt -lm -lresolv -lrt -Wl,--end-group real 4m22.598s user 4m20.832s sys 0m1.088s ``` ``` $ gcc --version gcc (Gentoo Hardened 13.0.1_pre20230409-r4 p9) 13.0.1 20230409 (experimental) Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ ld --version GNU ld (Gentoo 2.40 p4) 2.40.0 Copyright (C) 2023 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or (at your option) a later version. This program has absolutely no warranty. ```
If I run 'perf record ...' (... = command above), I get this from 'perf report': # Event count (approx.): 1111543334586 # # Overhead Command Shared Object Symbol # ........ ............... ............................................. ....................................................... # 97.85% ld.bfd ld.bfd [.] output_section_callback_sort 0.57% ld.bfd libz.so.1.2.13 [.] inflate_fast 0.19% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] sec_merge_hash_lookup.lto_priv.0 0.06% ld.bfd libz.so.1.2.13 [.] adler32_z 0.05% ld.bfd [zzstd] [k] ZSTD_decompressSequences_bmi2.constprop.0 0.05% ld.bfd [kernel.kallsyms] [k] copy_user_generic_string 0.04% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] bfd_lookup_arch 0.03% ld.bfd [kernel.kallsyms] [k] clear_page_rep 0.03% ld.bfd [kernel.kallsyms] [k] stackleak_erase 0.02% ld.bfd [kernel.kallsyms] [k] psi_account_irqtime 0.02% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] _bfd_merged_section_offset.isra.0 0.02% ld.bfd ld.bfd [.] walk_wild.lto_priv.0 0.02% ld.bfd [kernel.kallsyms] [k] update_cfs_group 0.02% ld.bfd ld.bfd [.] resolve_wild_sections 0.02% ld.bfd [lz4_compress] [k] LZ4_compress_fast_extState 0.02% ld.bfd libz.so.1.2.13 [.] inflate_table 0.02% ld.bfd [kernel.kallsyms] [k] asm_exc_page_fault 0.01% ld.bfd [kernel.kallsyms] [k] memset 0.01% ld.bfd libc.so.6 [.] 0x000000000015505d 0.01% ld.bfd [kernel.kallsyms] [k] update_curr 0.01% ld.bfd [kernel.kallsyms] [k] exit_to_user_mode_prepare 0.01% ld.bfd ld.bfd [.] lang_process 0.01% ld.bfd libc.so.6 [.] malloc 0.01% ld.bfd [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt 0.01% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] _bfd_elf_link_iterate_on_relocs 0.01% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] bfd_hash_lookup 0.01% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] _bfd_elf_merge_sections 0.01% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] bfd_elf_final_link 0.01% ld.bfd [kernel.kallsyms] [k] task_tick_fair 0.01% ld.bfd [kernel.kallsyms] [k] __handle_mm_fault 0.01% ld.bfd libz.so.1.2.13 [.] inflate 0.01% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] _bfd_elf_make_section_from_shdr 0.01% ld.bfd libbfd-2.40.0.gentoo-sys-devel-binutils-mt.so [.] elf_link_read_relocs_from_section [...] (Tarball with files coming.)
(In reply to Sam James from comment #1) > (Tarball with files coming.) This contains the object files, command used, and perf.data: https://dev.gentoo.org/~sam/bugs/sourceware/30358/busybox-hangs.tar.xz
Created attachment 14829 [details] ps faux output
To get the hang `-Wl,--sort-section,alignment` is sufficient: ``` x86_64-pc-linux-gnu-gcc -static -o busybox_unstripped -Wl,--sort-section,alignment -Wl,--start-group applets/built-in.o archival/lib.a archival/libarchive/lib.a console-tools/lib.a coreutils/lib.a coreutils/libcoreutils/lib.a debianutils/lib.a klibc-utils/lib.a e2fsprogs/lib.a editors/lib.a findutils/lib.a init/lib.a libbb/lib.a libpwdgrp/lib.a loginutils/lib.a mailutils/lib.a miscutils/lib.a modutils/lib.a networking/lib.a networking/libiproute/lib.a networking/udhcp/lib.a printutils/lib.a procps/lib.a runit/lib.a selinux/lib.a shell/lib.a sysklogd/lib.a util-linux/lib.a util-linux/volume_id/lib.a archival/built-in.o archival/libarchive/built-in.o console-tools/built-in.o coreutils/built-in.o coreutils/libcoreutils/built-in.o debianutils/built-in.o klibc-utils/built-in.o e2fsprogs/built-in.o editors/built-in.o findutils/built-in.o init/built-in.o libbb/built-in.o libpwdgrp/built-in.o loginutils/built-in.o mailutils/built-in.o miscutils/built-in.o modutils/built-in.o networking/built-in.o networking/libiproute/built-in.o networking/udhcp/built-in.o printutils/built-in.o procps/built-in.o runit/built-in.o selinux/built-in.o shell/built-in.o sysklogd/built-in.o util-linux/built-in.o util-linux/volume_id/built-in.o -Wl,--end-group -Wl,--start-group -lcrypt -lm -lresolv -lrt -Wl,--end-group -fuse-ld=bfd ```
(In reply to Sam James from comment #4) > To get the hang `-Wl,--sort-section,alignment` is sufficient: > ``` Happens with -Wl,--sort-section,name too.
If I attach gdb after it's been running for a while: ``` 0x000055af4c8e93c1 in compare_section (bsec=<optimized out>, asec=<optimized out>, sort=<optimized out>) at ../bfd/bfd.h:1115 1115 return sec->name; (gdb) bt #0 0x000055af4c8e93c1 in compare_section (bsec=<optimized out>, asec=<optimized out>, sort=<optimized out>) at ../bfd/bfd.h:1115 #1 wild_sort (section=0x55af6efa7c58, file=0x55af6ef3fc30, sec=0x55af4ddd2d20, wild=0x55af4ddd28a0) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:652 #2 output_section_callback_sort (ptr=0x55af4ddd28a0, sec=0x55af4ddd2d20, section=0x55af6efa7c58, file=0x55af6ef3fc30, output=<optimized out>) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:685 #3 0x000055af4c8dd995 in walk_wild (s=0x55af4ddd28a0, callback=0x55af4c8e92f0 <output_section_callback_sort>, data=0x55af4ddd18f0) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:979 #4 0x000055af4c8ddc3e in wild (target=<optimized out>, output=<optimized out>, s=0x55af4ddd28a0) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:3052 #5 map_input_to_output_sections.isra.0 (s=0x55af4ddd28a0, os=os@entry=0x55af4ddd18f0, target=<optimized out>) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:4076 #6 0x000055af4c8ddb81 in map_input_to_output_sections.isra.0 (s=0x55af4ddd18f0, os=0x0, target=<optimized out>) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:4094 #7 0x000055af4c8dadb1 in lang_process () at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldlang.c:8158 #8 0x000055af4c8e185a in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/sys-devel/binutils-2.40-r4/binutils-2.40/ld/ldmain.c:504 (gdb) ```
binutils-2.39 completes quickly, binutils-2.40 doesn't.
$ git bisect bad af31506c31a59a6edbb13498d6075fa704b801cd is the first bad commit commit af31506c31a59a6edbb13498d6075fa704b801cd Author: Michael Matz <matz@suse.de> Date: Thu Nov 10 16:06:20 2022 +0100 Only use wild_sort_fast there's no reason why the tree-based variant can't always be used when sorting is required, it merely needs to also support filename sorting and have a fast path for insertion at end (aka rightmost tree leaf). The filename sorting isn't tested anywhere and the only scripttempl that uses it is avr (for 'SORT(*)(.ctors)'), and I believe even there it was a mistake. Either way, this adds a testcase for filename sorting as well. Then the non-BST based sorting can be simplified to only support the fast case of no sorting required at all (at the same time renaming the two variants to _sort and _nosort). ld/ldlang.c | 302 +++++++++++++++-------------------- ld/ldlang.h | 3 +- ld/testsuite/ld-scripts/sort-file.d | 18 +++ ld/testsuite/ld-scripts/sort-file.t | 6 + ld/testsuite/ld-scripts/sort-file1.s | 6 + ld/testsuite/ld-scripts/sort-file2.s | 6 + 6 files changed, 163 insertions(+), 178 deletions(-) create mode 100644 ld/testsuite/ld-scripts/sort-file.d create mode 100644 ld/testsuite/ld-scripts/sort-file.t create mode 100644 ld/testsuite/ld-scripts/sort-file1.s create mode 100644 ld/testsuite/ld-scripts/sort-file2.s
The patch for bug 30367 (https://sourceware.org/pipermail/binutils/2023-April/127120.html) unfortunately doesn't help.
Yeah, that patch wouldn't help, it's in a different area. Thanks for the complete testcase. I'm going to have a look here.
The old (insertion-sort) code only added something to the output section list if the considered section wasn't already either discarded or linked (by being part of a group for instance). That is done by lang_add_section. This output section list is the sorted container into which further insertions are done (and hence its length is the one determining performance). The binary search tree code always adds all candidates to the tree (which in our unlucky case here mostly degrades to a linked list), and only then goes through that tree to linearize it to a list which doesn't contain the discarded or already linked input sections (group members). So, the intermediate tree contains things that aren't ultimately going to be output, blowing performance down the drain. Otherwise the old insertion-sort code and the now always used tree-based code are equivalent here. But the example contains _many_ groups, and all of them have a .debug_macro section (and only that!) and it exists in many input files, so the difference is a search-chain length of about 1500 in the insertion sort case and about 106000 in the binary tree case, and that factor goes in quadraticly into performance (as said, the search tree is degraded in the example). So, the solution is obvious: don't add something to the tree if it won't be added to the linearized list later. That requires some refactoring. A hacky un-refactored patch for the above is below (relative to master). It restores performance. diff --git a/ld/ldlang.c b/ld/ldlang.c index 4bec280b9df..7e0a9db51e3 100644 --- a/ld/ldlang.c +++ b/ld/ldlang.c @@ -687,6 +687,19 @@ output_section_callback_sort (lang_wild_statement_type *ptr, if (unique_section_p (section, os)) return; + /* Don't add sections to the tree when we already know that + lang_add_section won't do anything with it. */ + if (section->output_section != NULL) + { + if (!link_info.non_contiguous_regions) + return; + + /* SECTION has already been handled in a special way + (eg. LINK_ONCE): skip it. */ + if (bfd_is_abs_section (section->output_section)) + return; + } + node = (lang_section_bst_type *) xmalloc (sizeof (lang_section_bst_type)); node->left = 0; node->right = 0;
Not tested the patch yet, but Matz, I'd like to say thanks for the insightful explanation. Really appreciated!
The master branch has been updated by Michael Matz <matz@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=670c91c0c5eb3fb16d42fe5b2d48a33c64e3ec52 commit 670c91c0c5eb3fb16d42fe5b2d48a33c64e3ec52 Author: Michael Matz <matz@suse.de> Date: Tue Apr 25 17:10:05 2023 +0200 Fix PR30358, performance with --sort-section since af31506c we only use the binary tree when section sorting is required. While its unbalanced and hence can degrade to a linear list it should otherwise have been equivalent to the old code relying on insertion sort. Unfortunately it was not. The old code directly used lang_add_section to populate the sorted list, the new code first populates the tree and only then does lang_add_section on the sorted result. In the testcase we have very many linkonce section groups, and hence lang_add_section won't actually insert anything for most of them. That limited the to-be-sorted list length previously. The tree-sorting code OTOH first created a tree of all candidates sections, including those that wouldn't be inserted by lang_add_section, hence increasing the size of the sorting problem. In the testcase the chain length went from about 1500 to 106000, and in the degenerated case (as in the testcase) that goes in quadratically. This splits out most of the early-out code from lang_add_section to its own function and uses the latter to avoid inserting into the tree. This refactoring slightly changes the order of early-out tests (the ones based on section flags is now done last, and only in lang_add_section). The new function is not a pure predicate: it can give warnings and it might change output_section, like the old early-out code did. I have also added a skip-warning case in the first discard case, whose non-existence seemed to have been an oversight. PR 30358 * ldlang.c (wont_add_section_p): Split out from ... (lang_add_section): ... here. (output_section_callback_sort): Use wont_add_section_p to not always add sections to the sort tree.
Fixed in master.
Thank you! Do you think the hacky patch (not the one committed) would be good enough for us to backport downstream for 2.40?
Yes, the hacky patch is fine for you to use, it won't introduce further problems (knock on wood! :) ).
(In reply to Michael Matz from comment #16) > Yes, the hacky patch is fine for you to use, it won't introduce further > problems > (knock on wood! :) ). Michael: I just added after some tests that patch to Gentoo testing (i.e. Gentoo binutils-2.40-r5). It'll get a lot of usage now. If nothing pops up over the next 2-3 weeks, I could bring the upstream 2.40 backports branch to the same level, if that's OK with you.
https://gitweb.gentoo.org/fork/binutils-gdb.git/log/?h=gentoo/binutils-2.40