Slow linking for ARM
Titus von Boxberg
Sat Dec 25 09:06:00 GMT 2010
Bill, Bryan, all,
Am 24.12.2010 um 23:04 schrieb Bill Pringlemeir:
> Ok. Now you are saying 20-30%, I thought it was 20x as long
> previously, but maybe you said 20%?
For completeness, here is a summary of my observations:
I'm building for ARM, X86 and PPC, all Linux-ELF-glibc.
Originally, I use ld of binutils-2.20.
The compiler is gcc 4.5.1.
The host OSes are 32-bit-X86-Linux and 64-bit-Intel-MacOS.
The measurements have taken place on MacOS, but for binutils-2.20
the feeling is not different on Linux.
The software is C++, with modest to high template usage.
I have about 15 applications; all but two of them are
portable between the architectures. This is how I compare
the linking times for the architectures.
All times compared are "user" times.
Comparison took place compiling and linking the software with -g
When not using -g the absolute times and also in some cases the
factor between ARM and PPC/X86 are reduced.
With ld-2.20 I observe:
- all software works.
- ld for ARM is always slower than for PPC/X86
- ld for PPC and X86 always use roughly the same time.
- The user time ratio between ld for ARM and for PPC/X86 varies
between 2 and about 25. The initially given factor of 4 was just a rough
average. Initially, I only was curious why ARM takes longer
than other archs.
- For most of my applications the factor is about 3-4.
- I have one kind of application where the factor and ARM linking time
explodes to roughly 25 to 30 and the absolute time is about 200s.
(This is the case when compiling and linking with -g.)
- with boost's asio lib http server example the factor is roughly two.
With gold-2.20 I observed:
- gold cannot produce executables for ARM.
On Bryan's suggestion, I gave the CVS-HEAD of binutils a try.
With ld-HEAD I observe:
- the linking time for the application that took 200s with ld-2.20
collapses to about 10-12s. This already solves my actual problem
which is to always use -g for compiling/linking but not
abuse the coffee maker when building for ARM.
Because I could not imagine this quite striking result,
I had to be pushed by Bryan to give a more recent binutils a try.
- Still, linking this application with ld-HEAD for ARM is roughly
two times slower than linking with ld-2.20 for X86/PPC.
- I did not test ld-HEAD any further yet.
With gold-HEAD I observe:
- gold can now produce executables for ARM, PPC and X86
- As *very* shortly tested, the programs execute correctly for
ARM and X86, but not for PowerPC.
- The linking time is roughly reduced by a factor of 4-5 when compared to ld's
times; not taking into account the very long "linking time explosion" application).
So I can confirm what has been claimed for gold.
- The factor ARM<->PPC/X86 is still not 1 but quite small, roughly 1.2 to 1.4
I make of this:
- ld's times are very dependent on whether -g is used or not.
- ld-2.20 has a major efficiency problem when linking
a certain kind of application for ARM (but not for PPC/X86).
- I still cannot tell what "certain" is; only thing I see is that the long
linking time program uses more libraries than others. I did not find a subset
of those libraries that triggers the problem for ARM.
- this problem of ld has more or less bin solved in a more recent version of binutils.
- gold is really fast. Even with -g it's so fast that I rightaway forgot to compare
to linking times when not using -g.
- there must be some difference between ARM and other archs.
If you or Bryan or someone else are interested in more measurements please let me know.
> From: "Titus von Boxberg" <email@example.com>
> Date: Tue, 21 Dec 2010 11:58:22 +0100
> Message-ID: <firstname.lastname@example.org>
>> Linking the same software is about 4 times slower for ARM than for
>> the other CPUs.
> Earlier, it was 4x. Is this due to gold/non-gold?
>> The resulting executable is biggest for PowerPC, so that cannot be
>> the reason; also the differences in size are not large enough to
>> explain the time figures.
> The PPC has 32 registers. The ARM makes all instructions conditional,
> the X86 has variable instruction size. If you are comparing the
> binaries, of different architechures, it is not really fair. I have
> observed that compressing the binaries on different architectures will
> result in the same size files. If not, then something is possible not
> being excluded with the linker. Ie, the PPC instruction set is not as
> dense as the ARM and x86, but they usually have the same amount of
> 'information' which is reflected in the original source code.
> There certainly could be architectural differences between the ARM and
> the other architectures. Different compile options might result in
> different link times as well. PPC has short and long jumps, GOT, etc.
> Sometimes compiles insert 'nops' to accommodate a long/short jump.
> Sometimes the linker has to move things around to make room. It does
> seem plausible that there is a technical reason why it takes 20%
> longer (but not 20x or 4x as long). I wouldn't say it is completely
> wasteful to investigate a 20% difference though. It could be a bug.
Thanks for the hints, I'll dig into that.
Regards and merry christmas.
For unsubscribe information see http://sourceware.org/lists.html#faq
More information about the crossgcc