Bug 1811 - ELF linker loads member of archive for common symbol
Summary: ELF linker loads member of archive for common symbol
Status: NEW
Alias: None
Product: binutils
Classification: Unclassified
Component: ld (show other bugs)
Version: 2.17
: P2 normal
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-04 21:18 UTC by H.J. Lu
Modified: 2005-11-06 04:58 UTC (History)
3 users (show)

See Also:
Host:
Target: i686-pc-linux-gnu
Build:
Last reconfirmed:


Attachments
A testcase (353 bytes, application/octet-stream)
2005-11-04 21:19 UTC, H.J. Lu
Details
A testcase (425 bytes, application/octet-stream)
2005-11-05 15:13 UTC, H.J. Lu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2005-11-04 21:18:13 UTC
Unlink COFF linker, ELF link loads member of archive when there is a
definition for a common symbol.
Comment 1 H.J. Lu 2005-11-04 21:19:53 UTC
Created attachment 744 [details]
A testcase

I got

[hjl@gnu-d archive]$ make
gcc    -c -o x.o x.c
gcc    -c -o foo.o foo.c
ar rv libfoo.a foo.o
ar: creating libfoo.a
a - foo.o
ld -r -o bar.o x.o libfoo.a
libfoo.a(foo.o):(.bss+0x4): multiple definition of `bar'
x.o:(.bss+0x0): first defined here
make: *** [bar.o] Error 1
rm foo.o
Comment 2 H.J. Lu 2005-11-04 21:50:38 UTC
The change was introduced by

http://sourceware.org/ml/binutils/1999-12/msg00057.html
Comment 3 H.J. Lu 2005-11-04 21:59:41 UTC
According to the comment, the change was made to follow Solaris and HPUX.
Comment 4 Kean Johnston 2005-11-05 04:39:26 UTC
For the record, I will repost a private mail I had with H.J about this.

=========
Ok I've spent a while staring at that code and thinking about
this, and even though you say it is there on purpose, I believe
it is wrong. Here's why.

It is expected behaviour to be able to over-ride variables and
objects that appear in libraries, whether shared or archive
makes no difference. For example, libraries like dmalloc
depend on that behaviour. In this case, so do a lot of GNU
programs. Even though libc may provide getopt, the program
wants to provide its own getopt, especially for long option
handling.

An explicitly named object should always take precedence over
an object in a library. The only reason that is *not* happening
here is becuase GNU getopt declares optarg as:
  char *optarg;
thus making it a common symbol. If I was to change that to
  char *optarg = 0;
then it becomes a normal data symbol and it will be used
in preference to the one in the library, which is exactly
the intended behaviour. The only reason the link editor has
to pull in teh object from the archive is if it provides some
*other* symbol that the program needs, and in that case you
would legitimately get a warning that the same symbol is
defined in two places. However, simply rejecting the explicitly
named object in favour of the object in the archive just becuase
the explicit object didn't initialize the variable breaks a
very fundamental UNIX paradigm.
========

I read the mail thread pointed to in #2, and Ian asked what SVR4/UnixWare
do. UnixWare treats it as I describe above. In fact the current GNU ld
is broken on that platform because of this. I spoke to the author of the
gABI and he maintains the Solaris linker is broken, and the UnixWare one
is correct. With no prompting he cited almost the exact same reasons I
outlined above. The problem is the gABI doesnt specify semantic interprtation
of COMMON symbols. In the gABI authors words, that was because the behaviour
was "older than ELF itself" and simply the way archives were meant to be
handled.
Comment 5 Ian Lance Taylor 2005-11-05 05:07:38 UTC
Unfortunately it's too simple to allude to the historical handling of common
symbols.  In a.out linkers when a common symbol appears in an object, and the
symbol is defined in an object in an archive, then the object in the archive is
pulled into the link (actually this is somewhat target dependent--the SunOS
linker would pull in definitions which were in the .data section but not ones
which were in the .text sectin, assuming that a function could never merge with
a common symbol).

Moreover, if a common symbol appears in an object, and the symbol is a common
symbol in an object in an archive, then in an a.out linker the size of the
common symbol is changed, but the object is *not* pulled into the link.

This last behaviour is of course pretty crazy.  But in general it isn't
reasonable for the ELF ABI to claim that they just rely on historical behaviour
for the definition of common symbols, because in fact ELF common symbols do not
act like historical ones do.

That said, I was never all that happy with this change, and I think the
behaviour before the change was more coherent.  But, unfortunately, given the
way that system files and libraries are written, it is important that we be
compatible with system linkers.  You say the UnixWare linker acts differently. 
That suggests that we need to make this target dependent.  This is precedent for
this in the a.out linker, and the use of the common_skip_ar_aymbols field in
struct bfd_link_info.
Comment 6 H.J. Lu 2005-11-05 15:13:36 UTC
Created attachment 746 [details]
A testcase

I think Solaris linker hehavior makes some senses. Kean, can you try
this testcase with your linker? I got

bash-3.00$ make
gcc    -c -o main.o main.c
gcc    -c -o define.o define.c
ar rv libtest.a define.o
ar: creating libtest.a
a - define.o
gcc -o main1 main.o libtest.a
gcc -o main2 main.o define.o
gcc -shared -o libtest.so define.o
gcc -o main3 main.o libtest.so -Wl,-rpath,.
./main1
3

./main2
3

./main3
3

It is very consistent.
Comment 7 Kean Johnston 2005-11-05 19:03:01 UTC
(In reply to comment #5)
> Unfortunately it's too simple to allude to the historical handling of common
> symbols.  In a.out linkers when a common symbol appears in an object, and the
> symbol is defined in an object in an archive, then the object in the archive is
> pulled into the link (actually this is somewhat target dependent--the SunOS
You sure about that?

I tries on OpenServer and UnixWare. On OSR5, I tried in both COFF and ELF
modes. In all three cases, the symbol was pulled from the object and NOT
the archive. The SunOS behaviour you described is a bit funky :)

> This last behaviour is of course pretty crazy.  But in general it isn't
> reasonable for the ELF ABI to claim that they just rely on historical behaviour
> for the definition of common symbols, because in fact ELF common symbols do not
> act like historical ones do.
Thats a fair comment. I guess it depends on whose view of "historical" behaviour
you take. The author of the gABI is of course a UnixWare-head, so his notion of
"historical" may be a wee bit biased. But he has been with
AT&T/USL/Novell/SCO/Caldera/SCO-again for an aweful long time, and is a mine of
historical info.

> That said, I was never all that happy with this change, and I think the
> behaviour before the change was more coherent.  But, unfortunately, given the
> way that system files and libraries are written, it is important that we be
> compatible with system linkers.  You say the UnixWare linker acts differently.
And OpenServer, for what thats worth (actually from a historical perspective,
its worth a bit becuase its a dual-ABI system supporting both SVR3.2 COFF and
SVR4 ELF).

The problem I have with teh current implementation is this. Despite what looks
like rational behaviour with H.J's test cases (I'll respond to his comment
next), I dont think the test case proves anything except that the bahviour
*looks* rational. But in terms of every day developer activities, its not.
*Especially* in the case where the symbol in teh object is the same as the
symbol in a system library. The particular case that casued me to discover this
bug was trying to compile jwhois with a version of gcc that was newly modified
to used the GNU ld (historically, on OSR5 and UnixWare, the native ld was used
which was why I never saw this problem before). jwhois legitimately wants to use
its own getopt() library, to support the GNU style long options. I now get a
link failure becuase optopt is defined in both jwhois and libc.so. It is worth
noting that libc.so is in fact a normal ar archive that has some number of
objects in it that are mean to be linked directly into the a.out, as well as a
copy of libc.so.1, which is what gets you the dynamic portion - a common trick).
The libc.so has a member opt_Data.a, which defined optopt, optind, optarg etc.
optopt is initialized to 0. In jwhois (and indeed anything that uses the GNU
getopt), optopt isn't initialized, its just declared as 'char *optopt;'. By
forcing the symbol to come from libc.so simply becuase the one in there is a
normal data symbol and the the one in getopt.o is a common is wrong. The linker
needs no other symbols from opt_data.o, and is pulling it in only because of the
common/global thing.

Extend that to more common cases where I want to, for example, override malloc
for a debugging malloc library. If any portion of malloc had a data symbol (like
a mallopt structure or some such), I would be unable to override malloc() with
my spiffy new malloc-debugging library becuase GNU ld would be pulling in the
object from the library.

The above situation is made even worse when you are using libc.a instead of
libc.so, for static links.
 
> That suggests that we need to make this target dependent.  This is precedent for
> this in the a.out linker, and the use of the common_skip_ar_aymbols field in
> struct bfd_link_info.

Of course I would be happy with making this behaviour optional, becuase that
would get around my immediate problem and I can go about using GNU ld to my
heart's content. But I think that people who think they need the current
behaviour are in for some nasty surprises, as described above. I tested this on
Solaris 10, and the native link editor does in fact behave the same way the GNU
one does, but that doesn't necessarily make either one correct.

Sorry for the rambling reply :)
Comment 8 Kean Johnston 2005-11-05 19:06:20 UTC
> I think Solaris linker hehavior makes some senses. Kean, can you try
> this testcase with your linker?
I agree on teh surface it makes sense, but it also has very specific broken
behaviour. See previous comment.

> ./main1
> 3
> 
> ./main2
> 3
> 
> ./main3
> 3
> 

I get:
./main1
0

./main2
3

./main3
3

On OSR5 in COFF mode (no main3 becasue no shared libraries):

./main1
0

./main2
3

The above using the native tools of course. Using gcc, I get the same results
you do becuase its using the same ld you are.
Comment 9 Ian Lance Taylor 2005-11-06 04:58:08 UTC
Am I sure about the a.out behaviour?  Yes, I am.  When I refer to SunOS I do
mean SunOS 4, pre Solaris, which used the a.out object file format.

The strange behaviour of common symbols increasing size even without linking in
the object file was used to make stdin/stdout/stderr work in the traditional
a.out libc.  A linker which failed to implement it correctly could not link a
"hello, world" program.

AT&T went to COFF in SVR3, and they changed the behaviour of common symbols at
that time.  I've used SVR2, but I don't have a clear recollection of how the
linker worked.


I think that on Solaris we have to do what the native linker does.  Likewise on
UnixWare.  So if they have different behaviour, we have to have different
defaults.  It would of course be reasonable to provide a command line option to
control this.