Bug 19542

Summary: Performance penalty when linking chromium executable
Product: binutils Reporter: Armin K. <krejzi>
Component: ldAssignee: H.J. Lu <hjl.tools>
Status: RESOLVED FIXED    
Severity: normal CC: amodra, hjl.tools, markus
Priority: P2    
Version: 2.26   
Target Milestone: 2.27   
Host: Target:
Build: Last reconfirmed:
Attachments: compiler and linker command line
Marked ps aux output

Description Armin K. 2016-01-31 09:39:30 UTC
Created attachment 8944 [details]
compiler and linker command line

When linking chromium executable, the link process now takes insane amount of time with binutils-2.26.

Previously, with binutils-2.25.x series, it would take no longer than 5 minutes. With binutils 2.26, it takes more than 40 minutes of hogging the CPU to produce the same executable.

The file containing information about compiler and linker invocation is attached. I am not sure if any of the parameters causes the issue.
Comment 1 Markus Trippelsdorf 2016-01-31 09:51:34 UTC
This happens when using ld.bfd, right?

You could try ld.gold instead. It links chromium in a few seconds on my machine.
Comment 2 Markus Trippelsdorf 2016-01-31 10:08:00 UTC
Also make sure you have enough RAM and your system is not swapping.
Comment 3 Armin K. 2016-01-31 10:21:53 UTC
(In reply to Markus Trippelsdorf from comment #1)
> This happens when using ld.bfd, right?
> 
> You could try ld.gold instead. It links chromium in a few seconds on my
> machine.

Yes ld.bfd. I had a couple of issues when using ld.gold in the past. I could try it again though.
Comment 4 Armin K. 2016-01-31 10:22:31 UTC
(In reply to Markus Trippelsdorf from comment #2)
> Also make sure you have enough RAM and your system is not swapping.

RAM usage stays constant at 22% as reported by htop. I have 6 GB of RAM and nowhere near 6 GB is occupied on my system.
Comment 5 H.J. Lu 2016-01-31 14:38:49 UTC
Please

1. Provide a small testcase to show 8X slowdown.  Or
2. Provide ALL linker inputs.  Or
3. Show us where time is spent in linker.
Comment 6 Armin K. 2016-01-31 17:20:48 UTC
Created attachment 8945 [details]
Marked ps aux output

The attached file is actually ps aux output containing clang++ and ld invocation command lines. You can examine the file to find the "CPU time" field, which is in the captured file at 30 minutes running (which was equivalent to the real time that was spent running the ld executable). After capturing the output, it has been running for at least 15 more minutes. I was watching in htop, and the time field was updating just like the real time, so it's as good metric as any other.
Comment 7 Markus Trippelsdorf 2016-01-31 17:39:37 UTC
> 3. Show us where time is spent in linker.

Can you please "perf record" the linker invocation for some minutes
and then post the "perf report" output here.
Comment 8 Markus Trippelsdorf 2016-02-01 17:36:40 UTC
Confirmed.

With gold:
11.55s user 0.77s system 99% cpu 12.336 total

With ld.bfd (I've hit ctrl-c after 3 minutes):

Overhead  Command  Shared Object               Symbol 
  72.53%  ld.bfd   libbfd-2.26.51.20160113.so  [.] elf_x86_64_size_dynamic_sections 
   9.55%  ld.bfd   libc-2.22.90.so             [.] __GI__IO_un_link.part.1
   2.33%  ld.bfd   libc-2.22.90.so             [.] memcpy@@GLIBC_2.14
   2.08%  ld.bfd   libc-2.22.90.so             [.] __gconv_transform_utf8_internal
   1.81%  ld.bfd   libbfd-2.26.51.20160113.so  [.] bfd_elf_link_add_symbols
   1.66%  ld.bfd   ld.bfd                      [.] match_simple_wild
   1.34%  ld.bfd   ld.bfd                      [.] name_match
   0.87%  ld.bfd   libbfd-2.26.51.20160113.so  [.] bfd_hash_lookup
   0.78%  ld.bfd   ld.bfd                      [.] walk_wild_section_specs3_wild2
Comment 9 Alan Modra 2016-02-02 03:24:04 UTC
I would guess most of the time is spent in elf_x86_64_convert_load.  It looks expensive in terms of memory, and is quadratic in number of sections (called per section, and has a section loop internally).  ld --no-keep-memory might help.
Comment 10 H.J. Lu 2016-02-02 13:29:47 UTC
It is caused by

commit 59cab532835904f368b0aa99267afba5fda5ded2
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Jun 24 10:13:55 2015 -0700

    Don't convert R_X86_64_GOTPCREL if it will overflow
    
    When converting "mov foo@GOTPCREL(%rip), %reg" to "lea foo(%rip), %reg"
    with R_X86_64_PC32 relocation, it may overflow if the target section
    is more than 2GB away.  This patch estimates distance between mov
    instruction and the target section.  We convert R_X86_64_GOTPCREL to
    R_X86_64_PC32 only if their distance is less than 2GB.
    
      PR ld/18591
      * elf64-x86-64.c (elf_x86_64_convert_mov_to_lea): Don't convert
      R_X86_64_GOTPCREL to R_X86_64_PC32 if it will cause relocation
      overflow.
Comment 11 Sourceware Commits 2016-02-02 16:22:47 UTC
The master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=4a539596f5d54d3116c5fdebd8be56998757288b

commit 4a539596f5d54d3116c5fdebd8be56998757288b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Feb 2 08:14:43 2016 -0800

    Store estimated istrances in compressed_size
    
    elf_x86_64_convert_load is very time consuming since it is called on
    each input section and has a loop over input text sections to estimate
    the branch distrance.  We can store the estimated distrances in the
    compressed_size field of the output section, which is only used to
    decompress the compressed input section.
    
    Before the patch, linking clang 3.9 takes 52 seconds.  After the patch,
    it only takes 2.5 seconds.
    
    	PR ld/19542
    	* elf64-x86-64.c (elf_x86_64_convert_load): Store the estimated
    	distrances in the compressed_size field of the output section.
Comment 12 H.J. Lu 2016-02-02 16:23:49 UTC
Fixed on 2.27 so far.
Comment 13 Mike Frysinger 2016-02-02 18:13:39 UTC
as it's a significant regression, imo it should make it to the branch
Comment 14 H.J. Lu 2016-02-02 18:36:40 UTC
Please provide linking chromium executable timing numbers with ld.bfd
from master branch.
Comment 15 Markus Trippelsdorf 2016-02-02 21:01:19 UTC
Looks "good" now:

gold:
 12.57s user 0.86s system 99% cpu 13.442 total

ld.bfd:
 54.23s user 4.63s system 99% cpu 59.366 total

Overhead  Command   Shared Object               Symbol
  31.52%  ld.bfd    libc-2.22.90.so             [.] _IO_un_link
   7.29%  ld.bfd    libc-2.22.90.so             [.] memcpy@@GLIBC_2.14
   7.03%  ld.bfd    libc-2.22.90.so             [.] __gconv_transform_utf8_internal
   5.42%  ld.bfd    libbfd-2.26.51.20160202.so  [.] bfd_elf_link_add_symbols
   4.65%  ld.bfd    ld.bfd                      [.] match_simple_wild
   4.31%  ld.bfd    ld.bfd                      [.] name_match
   3.03%  ld.bfd    libbfd-2.26.51.20160202.so  [.] bfd_hash_lookup
   2.88%  ld.bfd    ld.bfd                      [.] walk_wild_section_specs3_wild2
Comment 16 Sourceware Commits 2016-02-02 21:10:22 UTC
The binutils-2_26-branch branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=cc2d819b58fd5d60dfef34007662535f9e142c16

commit cc2d819b58fd5d60dfef34007662535f9e142c16
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Feb 2 08:14:43 2016 -0800

    Store estimated distances in compressed_size
    
    elf_x86_64_convert_load is very time consuming since it is called on
    each input section and has a loop over input text sections to estimate
    the branch distrance.  We can store the estimated distances in the
    compressed_size field of the output section, which is only used to
    decompress the compressed input section.
    
    Before the patch, linking clang 3.9 takes 52 seconds.  After the patch,
    it only takes 2.5 seconds.
    
    Backport from master
    
    	PR ld/19542
    	* elf64-x86-64.c (elf_x86_64_convert_load): Store the estimated
    	distances in the compressed_size field of the output section.
Comment 17 H.J. Lu 2016-02-02 21:11:19 UTC
Fixed for 2.26.1.