Bug 23411

Summary: Different behavior when linking common symbol statically or to shared object
Product: binutils Reporter: H.J. Lu <hjl.tools>
Component: goldAssignee: Cary Coutant <ccoutant>
Status: REOPENED ---    
Severity: normal CC: amonakov, ian
Priority: P2    
Version: 2.32   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description H.J. Lu 2018-07-13 17:55:41 UTC
[hjl@gnu-cfl-1 gold-2]$ cat con.c
int globalInt = 7;
[hjl@gnu-cfl-1 gold-2]$ cat main.c
#include <stdio.h>

int globalInt;

int main()
{
   printf("globalInt is %d\n", globalInt);
   return 0;
}
[hjl@gnu-cfl-1 gold-2]$ make
gcc -fuse-ld=gold -fpic -c -o main.o main.c
gcc -fuse-ld=gold -fpic -c -o con.o con.c
gcc -fuse-ld=gold -shared -fpic -o libcon.so con.o
gcc -fuse-ld=gold -o app.shared main.o -L. -lcon -Wl,-R,.
gcc -fuse-ld=gold -o app.static main.o con.o
ar rv libcon.a con.o
r - con.o
gcc -fuse-ld=gold -o app.a main.o libcon.a
./app.shared
globalInt is 0
./app.static
globalInt is 7
./app.a
globalInt is 0
[hjl@gnu-cfl-1 gold-2]$ make clean
rm -f *.o libcon.so app.shared app.static
[hjl@gnu-cfl-1 gold-2]$ make CC=gcc
gcc -fpic -c -o main.o main.c
gcc -fpic -c -o con.o con.c
gcc -shared -fpic -o libcon.so con.o
gcc -o app.shared main.o -L. -lcon -Wl,-R,.
gcc -o app.static main.o con.o
ar rv libcon.a con.o
r - con.o
gcc -o app.a main.o libcon.a
./app.shared
globalInt is 7
./app.static
globalInt is 7
./app.a
globalInt is 7
[hjl@gnu-cfl-1 gold-2]$
Comment 1 Alexander Monakov 2018-07-13 19:43:19 UTC
I think the behavior exhibited by gold is expected given absence of --whole-archive, and it's ld.bfd that is doing something not mandated by the ELF spec here.
Comment 2 Cary Coutant 2018-07-13 20:07:56 UTC
For ANSI/ISO C and C++, gold is doing the right thing here.

Time for a history lesson...

In the old days of K&R C, a declaration like "int globalInt;" was considered a "tentative definition", and was implemented using the same linker mechanism as a Fortran common block (hence, the name "common"). Linkers in the K&R era had the following rules with respect to linking from archive libraries:

- undef in .o, common in archive member: upgrade the undef to a common, but do not include the archive member (unless it also provides another needed symbol definition).

- common in .o, common in archive member: don't include the archive member.

- common in .o, regular def in archive member: include the archive member.

By those rules, yes, the linker would have included con.o from the archive, and you'd see the value 7.

When ANSI C came along, the notion of "tentative definition" was dropped from the language. A declaration "int globalInt;" was considered the same as "int globalInt = 0;", with one exception -- in a nod to compatibility with existing source code, compilers were permitted to allow multiple definitions of the old tentative form, but were supposed to diagnose multiple definitions that did not agree, so your static linking example (main.o + con.o) should give an error.

This required a slight change in the rules of linking:

- undef in .o, common in archive member: include the archive member.

- common in .o, common in archive member: don't include the archive member.

- common in .o, regular def in archive member: don't include the archive member.

Gold implements these rules.

Some compilers chose to implement "int globalInt;" as a regular definition, and kept the use of common for K&R C only. At HP, we added a new SHN_HP_ANSI_COMMON so that we could distinguish between the two cases. Others simply continued to generate common symbols as always (technically in violation of the language standard).

If you change main.c to say "int globalInt = 0;", ld.bfd and gold will both give the same results as gold does in your example, except that the static case (main.o + con.o) will produce a multiple definition error (as it should). According to the language standard, the results should be the same with or without the "= 0".
Comment 3 H.J. Lu 2018-07-13 20:53:25 UTC
We can agree to disagree. I believe the new rule should be enforced
by compiler not to generate COMMON symbol, not by linker.
Comment 4 zenith432 2018-07-14 11:16:15 UTC
There's an ambiguity in ANSI C about this.  From C11
Section 6.9.2 External object definitions

1 If the declaration of an identifier for an object has file scope and an initializer, the declaration is an external definition for the identifier.

2 A declaration of an identifier for an object that has file scope without an initializer, and without a storage-class specifier or with the storage-class specifier static, constitutes a tentative definition. If a translation unit contains one or more tentative definitions for an identifier, and the translation unit contains no external definition for that identifier, then the behavior is exactly as if the translation unit contains a file scope declaration of that identifier, with the composite type as of the end of the translation unit, with an initializer equal to 0.

And then there's Annex J (Informative) Portability issues
J.5.11 Multiple external definitions

1 There may be more than one external definition for the identifier of an object, with or without the explicit use of the keyword extern; if the definitions disagree, or more than one is initialized, the behavior is undefined (6.9.2).

So the normative 6.9.2 decrees things to be the way described in Comment 2, but then J.5.11 which is only informative and is about portability issues talks about "definitions disagree, or more than one is initialized" - but according to 6.9.2 they're always initialized.  So when they wrote J.5.11 they still had in mind the concept of a tentative definition which is uninitialized.
You could argue there's no contradiction, but there's no reason to include the phrase "or more than one is initialized" unless the concept of tentative definition still has any impact.

Anyways, the choice of whether to use ANSI C or GNU extensions is given to the compiler - what business does the linker have enforcing ANSI C after a choice was made during compile-time to use extensions?  There are no ld options to choose whether to enforce ANSI C or not.  The linker is not even restricted to C.  Or C++ (which has ODR).  Fortran has common blocks.  ld supports Fortran, doesn't it?
Comment 5 Cary Coutant 2018-07-14 17:20:58 UTC
The point about Fortran is a fair one, and, I think, compelling. If a .o contains a named common block, and an archive contains a BLOCK DATA, the linker should include the BLOCK DATA.

Compiling with gcc -fno-common does provide a way to get the strict ANSI C behavior, and requires no extra linker support.
Comment 6 zenith432 2018-07-14 18:19:04 UTC
(In reply to Alexander Monakov from comment #1)
> I think the behavior exhibited by gold is expected given absence of
> --whole-archive, and it's ld.bfd that is doing something not mandated by the
> ELF spec here.

So the ELF spec does lay down an undecisive law for dynamic linking of common.  Couldn't find anything comparable for static linking of archives.

http://www.sco.com/developers/gabi/latest/ch4.symtab.html
section 4.20
When the dynamic linker encounters a reference to a symbol that resolves to a definition of type STT_COMMON, it may (but is not required to) change its symbol resolution rules as follows: instead of binding the reference to the first symbol found with the given name, the dynamic linker searches for the first symbol with that name with type other than STT_COMMON. If no such symbol is found, it looks for the STT_COMMON definition of that name that has the largest size.

In the prior section 4.19 there is something about weak symbols which have similar search conundrums
     When the link editor combines several relocatable object files, it does not allow multiple definitions of STB_GLOBAL symbols with the same name. On the other hand, if a defined global symbol exists, the appearance of a weak symbol with the same name will not cause an error. The link editor honors the global definition and ignores the weak ones. Similarly, if a common symbol exists (that is, a symbol whose st_shndx field holds SHN_COMMON), the appearance of a weak symbol with the same name will not cause an error. The link editor honors the common definition and ignores the weak ones.

    When the link editor searches archive libraries [see ``Archive File'' in Chapter 7], it extracts archive members that contain definitions of undefined global symbols. The member's definition may be either a global or a weak symbol. The link editor does not extract archive members to resolve undefined weak symbols. Unresolved weak symbols have a zero value.

============
The ambiguity I mentioned about ANSI C is resolved by noticing that Annex J.5 is a list of common externsions to C.  They are allowed by the STD for portability, but not necessarily conforming.

The GCC documentation for -fno-common states
=====
fno-common
In C code, this option controls the placement of global variables defined with-
out an initializer, known as tentative definitions in the C standard.  Tentative definitions are distinct from declarations of a variable with the extern keyword, which do not allocate storage.
Unix C compilers have traditionally allocated storage for uninitialized global
variables  in  a  common  block.   This  allows  the  linker  to  resolve  all  tentative definitions of the same variable in different compilation units to the same object, or to a non-tentative definition.  This is the behavior specified by ‘-fcommon’, and is the default for GCC on most targets.  On the other hand, this behavior is not required by ISO C, and on some targets may carry a speed or code size penalty on variable references.
The ‘-fno-common’ option specifies that the compiler should instead place unini-
tialized global variables in the data section of the object file.  This inhibits the merging of tentative definitions by the linker so you get a multiple-definition error if the same variable is defined in more than one compilation unit.  Compiling with ‘-fno-common’ is useful on targets for which it provides better performance, or if you wish to verify that the program will work on other systems that always treat uninitialized variable definitions this way.
=====
So the GCC doc claims that the common-based implementation of tentative definitions is simply "not required by ISO C", but the last sentence is a tacit admission that it is in fact non-conforming to ISO C - non-portable.
Comment 7 Alexander Monakov 2018-07-16 11:28:59 UTC
BFD's behavior becomes very problematic with LTO: archive index does not distinguish between normal and common definitions, and there's no explicit plugin API to check if a definition in IR is "common", so the linker has to round-trip via the plugin in a fairly convoluted way to emulate the non-LTO behavior :(

I think it would be prudent to understand *why* ld.bfd wants to override common definitions by unpacking archive members with strong definitions. I don't expect it's for Fortran - is the real reason stated anywhere?
Comment 8 Cary Coutant 2018-07-16 20:42:41 UTC
> BFD's behavior becomes very problematic with LTO: archive index does not
> distinguish between normal and common definitions, and there's no explicit
> plugin API to check if a definition in IR is "common", so the linker has to
> round-trip via the plugin in a fairly convoluted way to emulate the non-LTO
> behavior :(

Sorry, I don't understand your point. Why does the archive index need
to distinguish between normal and common definitions? I can understand
how it would help, but as long as common symbols are listed in the
archive index, the linker can take a closer look at the archive member
to decide whether or not to include the file.

And the plugin API does provide LDPK_COMMON. What's the "convoluted
round trip" you're referring to?

Hmm, maybe I do see what you're getting at... If the linker decides,
based on the archive index, to include an IR file, it asks the plugin
to provide the symbols for the "closer look", but the API doesn't
provide a way for the linker to then say, "never mind, I don't need
that object after all," in the case where the only symbol of interest
happens to be a common. Is that a problem we really need to worry
about? The only real issue with including an otherwise-unneeded object
is the potential violation of the language's namespace pollution rules
(see below).

> I think it would be prudent to understand *why* ld.bfd wants to override common
> definitions by unpacking archive members with strong definitions. I don't
> expect it's for Fortran - is the real reason stated anywhere?

Common blocks were originally for Fortran, yes, and it's clear that
the linker should include from an archive a BLOCK DATA subprogram that
initializes a common block used elsewhere in the program. The linker
support for this was leveraged for K&R C's tentative definitions, so
the Fortran behavior became the de facto behavior for C.

ANSI C's namespace pollution rules are responsible for the newer
behavior. The scenario we're defending against is when a system
library defines a symbol A that is introduced into the namespace only
if header <A.h> is included, and a program that does not include that
header defines its own symbol A for unrelated purposes. The K&R
behavior would have the library's definition of A override the user's
A if the user's A is a tentative definition. The only way to prevent
this is to avoid the use of K&R-style common for tentative
definitions, or (as on HP-UX) to introduce a second type of "ANSI
common".

The argument for switching back to the K&R rules is that (a) we
haven't implemented the "ANSI common" in the Gnu toolchain, (b) we
need to preserve the old behavior for both Fortran and K&R C code, and
(c) we can achieve strict ANSI compliance through the use of GCC's
-fno-common flag.
Comment 9 Alexander Monakov 2018-07-17 04:37:29 UTC
(In reply to Cary Coutant from comment #8)
>  Is that a problem we really need to worry
> about? The only real issue with including an otherwise-unneeded object
> is the potential violation of the language's namespace pollution rules
> (see below).

The reason it's being reported now is complications with plugin interaction. Recently, GCC strengthened internal consistency checks for LTO symbol resolution, and (combined with arguably poor choices for statuses reported for symbols coming from unused members) they fail when such unpacked-but-unused archive members get involved: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86490

I'm chiming in because I've been working on a third party (i.e. non-GCC, non-LLVM) linker plugin recently, and this ld.bfd-specific behavior was completely unexpected and cost some time. The ELF specification does not allow this, the other docs I consulted did not mention it, and the way ar index is built suggests common status is irrelevant for extraction (otherwise the index would include the common flag).

If the BFD behavior is really intended, can at least the documentation please be improved to explain that? And what would really help is a dedicated plugin API entrypoint to allow the linker ask the plugin, "does IR in this file provide a strong definition for any currently common symbol in this array: ["foo", "bar", ...]?"
Comment 10 zenith432 2018-07-17 08:53:02 UTC
(In reply to Alexander Monakov from comment #9)
> The ELF specification does not allow this...

Where do you read that it's disallowed?  Leaving something unspecified is not the same as disallowing it.

(In reply to Cary Coutant from comment #8)
> ... If the linker decides, based on the archive index, to include
> an IR file, it asks the plugin to provide the symbols for the
> "closer look", but the API doesn't provide a way for the linker to
> then say, "never mind, I don't need that object after all," in the case
> where the only symbol of interest happens to be a common.

gold already demonstrates how to do this...

Pluginobj::get_symbol_resolution_info
>  if (static_cast<size_t>(nsyms) > this->symbols_.size())
>    {
>      // We never decided to include this object. We mark all symbols as
>      // preempted.
>      gold_assert(this->symbols_.size() == 0);
>      for (int i = 0; i < nsyms; i++)
>        syms[i].resolution = LDPR_PREEMPTED_REG;
>      return version > 2 ? LDPS_NO_SYMS : LDPS_OK;
>    }

It's a bug in ld.bfd that it ends up setting multiple prevailing defs in this scenario.  The plugin api prohibits(*) setting multiple prevailing defs, and it's not necessary to do so in order to have lto1 purge unwanted symbol defs.

(*) https://gcc.gnu.org/wiki/whopr/driver, rules for the "All Symbols Read" event.