Summary: | Performance penalty when linking chromium executable | ||
---|---|---|---|
Product: | binutils | Reporter: | Armin K. <krejzi> |
Component: | ld | Assignee: | H.J. Lu <hjl.tools> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | amodra, hjl.tools, markus |
Priority: | P2 | ||
Version: | 2.26 | ||
Target Milestone: | 2.27 | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
compiler and linker command line
Marked ps aux output |
This happens when using ld.bfd, right? You could try ld.gold instead. It links chromium in a few seconds on my machine. Also make sure you have enough RAM and your system is not swapping. (In reply to Markus Trippelsdorf from comment #1) > This happens when using ld.bfd, right? > > You could try ld.gold instead. It links chromium in a few seconds on my > machine. Yes ld.bfd. I had a couple of issues when using ld.gold in the past. I could try it again though. (In reply to Markus Trippelsdorf from comment #2) > Also make sure you have enough RAM and your system is not swapping. RAM usage stays constant at 22% as reported by htop. I have 6 GB of RAM and nowhere near 6 GB is occupied on my system. Please 1. Provide a small testcase to show 8X slowdown. Or 2. Provide ALL linker inputs. Or 3. Show us where time is spent in linker. Created attachment 8945 [details]
Marked ps aux output
The attached file is actually ps aux output containing clang++ and ld invocation command lines. You can examine the file to find the "CPU time" field, which is in the captured file at 30 minutes running (which was equivalent to the real time that was spent running the ld executable). After capturing the output, it has been running for at least 15 more minutes. I was watching in htop, and the time field was updating just like the real time, so it's as good metric as any other.
> 3. Show us where time is spent in linker.
Can you please "perf record" the linker invocation for some minutes
and then post the "perf report" output here.
Confirmed. With gold: 11.55s user 0.77s system 99% cpu 12.336 total With ld.bfd (I've hit ctrl-c after 3 minutes): Overhead Command Shared Object Symbol 72.53% ld.bfd libbfd-2.26.51.20160113.so [.] elf_x86_64_size_dynamic_sections 9.55% ld.bfd libc-2.22.90.so [.] __GI__IO_un_link.part.1 2.33% ld.bfd libc-2.22.90.so [.] memcpy@@GLIBC_2.14 2.08% ld.bfd libc-2.22.90.so [.] __gconv_transform_utf8_internal 1.81% ld.bfd libbfd-2.26.51.20160113.so [.] bfd_elf_link_add_symbols 1.66% ld.bfd ld.bfd [.] match_simple_wild 1.34% ld.bfd ld.bfd [.] name_match 0.87% ld.bfd libbfd-2.26.51.20160113.so [.] bfd_hash_lookup 0.78% ld.bfd ld.bfd [.] walk_wild_section_specs3_wild2 I would guess most of the time is spent in elf_x86_64_convert_load. It looks expensive in terms of memory, and is quadratic in number of sections (called per section, and has a section loop internally). ld --no-keep-memory might help. It is caused by commit 59cab532835904f368b0aa99267afba5fda5ded2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Jun 24 10:13:55 2015 -0700 Don't convert R_X86_64_GOTPCREL if it will overflow When converting "mov foo@GOTPCREL(%rip), %reg" to "lea foo(%rip), %reg" with R_X86_64_PC32 relocation, it may overflow if the target section is more than 2GB away. This patch estimates distance between mov instruction and the target section. We convert R_X86_64_GOTPCREL to R_X86_64_PC32 only if their distance is less than 2GB. PR ld/18591 * elf64-x86-64.c (elf_x86_64_convert_mov_to_lea): Don't convert R_X86_64_GOTPCREL to R_X86_64_PC32 if it will cause relocation overflow. The master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=4a539596f5d54d3116c5fdebd8be56998757288b commit 4a539596f5d54d3116c5fdebd8be56998757288b Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Feb 2 08:14:43 2016 -0800 Store estimated istrances in compressed_size elf_x86_64_convert_load is very time consuming since it is called on each input section and has a loop over input text sections to estimate the branch distrance. We can store the estimated distrances in the compressed_size field of the output section, which is only used to decompress the compressed input section. Before the patch, linking clang 3.9 takes 52 seconds. After the patch, it only takes 2.5 seconds. PR ld/19542 * elf64-x86-64.c (elf_x86_64_convert_load): Store the estimated distrances in the compressed_size field of the output section. Fixed on 2.27 so far. as it's a significant regression, imo it should make it to the branch Please provide linking chromium executable timing numbers with ld.bfd from master branch. Looks "good" now: gold: 12.57s user 0.86s system 99% cpu 13.442 total ld.bfd: 54.23s user 4.63s system 99% cpu 59.366 total Overhead Command Shared Object Symbol 31.52% ld.bfd libc-2.22.90.so [.] _IO_un_link 7.29% ld.bfd libc-2.22.90.so [.] memcpy@@GLIBC_2.14 7.03% ld.bfd libc-2.22.90.so [.] __gconv_transform_utf8_internal 5.42% ld.bfd libbfd-2.26.51.20160202.so [.] bfd_elf_link_add_symbols 4.65% ld.bfd ld.bfd [.] match_simple_wild 4.31% ld.bfd ld.bfd [.] name_match 3.03% ld.bfd libbfd-2.26.51.20160202.so [.] bfd_hash_lookup 2.88% ld.bfd ld.bfd [.] walk_wild_section_specs3_wild2 The binutils-2_26-branch branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=cc2d819b58fd5d60dfef34007662535f9e142c16 commit cc2d819b58fd5d60dfef34007662535f9e142c16 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Feb 2 08:14:43 2016 -0800 Store estimated distances in compressed_size elf_x86_64_convert_load is very time consuming since it is called on each input section and has a loop over input text sections to estimate the branch distrance. We can store the estimated distances in the compressed_size field of the output section, which is only used to decompress the compressed input section. Before the patch, linking clang 3.9 takes 52 seconds. After the patch, it only takes 2.5 seconds. Backport from master PR ld/19542 * elf64-x86-64.c (elf_x86_64_convert_load): Store the estimated distances in the compressed_size field of the output section. Fixed for 2.26.1. |
Created attachment 8944 [details] compiler and linker command line When linking chromium executable, the link process now takes insane amount of time with binutils-2.26. Previously, with binutils-2.25.x series, it would take no longer than 5 minutes. With binutils 2.26, it takes more than 40 minutes of hogging the CPU to produce the same executable. The file containing information about compiler and linker invocation is attached. I am not sure if any of the parameters causes the issue.