Bug 21532 - AArch64: Symbol address inconsistency across compilation units
Summary: AArch64: Symbol address inconsistency across compilation units
Status: RESOLVED FIXED
Alias: None
Product: binutils
Classification: Unclassified
Component: ld (show other bugs)
Version: 2.29
: P2 critical
Target Milestone: ---
Assignee: Jiong Wang
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-30 14:14 UTC by Nick Clifton
Modified: 2017-06-15 16:30 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Demonstration of the bug (1.44 KB, application/gzip)
2017-05-30 14:14 UTC, Nick Clifton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Clifton 2017-05-30 14:14:46 UTC
Created attachment 10074 [details]
Demonstration of the bug

The unit test fails on AArch64 and passes on x86-64, s390x and ppc64.

A static const struct (APFloat::IEEEsingle) is defined in the shared library (libLLVM.so) and accessed both from within the library as well as the executable that links against the shared library.

On AArch64, taking the address of IEEEsingle yields different results for code in the executable and code in the shared library.

It looks like the linker creates a copy of IEEEsingle in the executable but then the GOT entry for IEEEsingle is not updated in libLLVM.so, causing any accesses to IEEEsingle in libLLVM.so to refer to the original definition (which in itself is not a big problem as all the data is constant).

However, this breaks the LLVM APFloat class because it uses the addresses of the various static const structs (IEEEsingle, IEEEdouble, IEEEquad, x87DoubleExtended, etc. which define the properties of the different floating point types) to dispatch on the floating point type in the member functions of APFloat. Thus it is crucial that any of those constants always have the same address no matter where they are referenced from.

APFloat is an arbitrary precision floating point implementation that is used in various places across the whole compiler, e.g. there are various codepaths involving APFloat that will lead to crashes on AArch64.

Running the attached test case should yield the following output on AArch64:

$ ./build.sh
Address of IEEEsingle (APFloatTest.cpp): 0x420208
Address of IEEEsingle (APFloat.cpp)    : 0x3ff78cf0a58

And the following output on x86-64:

$ ./build.sh
Address of IEEEsingle (APFloatTest.cpp): 0x7fbedc3cc9a0
Address of IEEEsingle (APFloat.cpp)    : 0x7fbedc3cc9a0

As you can see the addresses are not consistent on AArch64.
Comment 1 Nick Clifton 2017-06-05 15:06:03 UTC
Here is a reduced, C based, test case:

  % cat main.c

extern int   int_datum_1;
extern int   int_datum_2;
extern int   printf (const char *, ...);
extern void  call_lib (void);

int * ptr1 = & int_datum_1;

int main (void) 
{ 
  printf ("main: int data = %p %p\n", & int_datum_1, & int_datum_2);
  call_lib ();
  return 0;
}

  % cat lib.c

int int_datum_1 = 1;
int int_datum_2 = 1;
extern int printf (const char *, ...);

void call_lib (void) 
{ 
  printf ("lib : int data = %p %p\n", & int_datum_1, & int_datum_2); 
}

  % gcc -fPIC -c main.c lib.c
  % gcc -fPIC -Wl,-Bsymbolic,-z,defs,-soname,libfred.so -shared -o libl.so lib.o
  % gcc -fPIC -o test main.o libl.so
  % LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH ./test
 
main: int data = 0x420038 0xffff93f40018
lib : int data = 0xffff93f4001c 0xffff93f40018

  Note how the address of int_datum_2 is consistent, but the address of
  int_datum_1 differs.  The difference is that the address of int_datum_1
  is used to initialise ptr1 in main.c.
Comment 2 Jiong Wang 2017-06-06 11:14:23 UTC
Hi Nick,

  Looks like AArch64 could avoid emiting COPY relocation if it's a relocation against writable section. We could adopt similar code from elf_x86_64_adjust_dynamic_symbol or ppc64_elf_adjust_dynamic_symbol.

> It looks like the linker creates a copy of IEEEsingle in the executable but
> then the GOT entry for IEEEsingle is not updated in libLLVM.so, causing any
> accesses to IEEEsingle in libLLVM.so to refer to the original definition

  Does this means this is a glibc bug as well? dynamic linker needs to redirect the reference in the .so to the copy as well?
Comment 3 Nick Clifton 2017-06-06 11:34:39 UTC
Hi Jiong,

  Thanks for taking a look at this.

>    Looks like AArch64 could avoid emiting COPY relocation if it's a relocation
> against writable section. We could adopt similar code from
> elf_x86_64_adjust_dynamic_symbol or ppc64_elf_adjust_dynamic_symbol.

Ah - maybe not.  The original testcase (in the BZ) shows the bug happening when
constant data is used rather than writeable data.  You can also extend the second
testcase I posted to use constant data and the same problem happens.


>> It looks like the linker creates a copy of IEEEsingle in the executable but
>> then the GOT entry for IEEEsingle is not updated in libLLVM.so, causing any
>> accesses to IEEEsingle in libLLVM.so to refer to the original definition
> 
>    Does this means this is a glibc bug as well? dynamic linker needs to redirect
> the reference in the .so to the copy as well?

This is a possibility, although I am not an expert on the dynamic linker.  I think
however that we need to verify that the static linker is producing the correct
dynamic relocations first.

Cheers
  Nick
Comment 4 Jiong Wang 2017-06-06 11:46:29 UTC
(In reply to Nick Clifton from comment #3)
> Hi Jiong,
> 
>   Thanks for taking a look at this.
> 
> >    Looks like AArch64 could avoid emiting COPY relocation if it's a relocation
> > against writable section. We could adopt similar code from
> > elf_x86_64_adjust_dynamic_symbol or ppc64_elf_adjust_dynamic_symbol.
> 
> Ah - maybe not.  The original testcase (in the BZ) shows the bug happening
> when
> constant data is used rather than writeable data.  You can also extend the
> second
> testcase I posted to use constant data and the same problem happens.
> 
> 
> >> It looks like the linker creates a copy of IEEEsingle in the executable but
> >> then the GOT entry for IEEEsingle is not updated in libLLVM.so, causing any
> >> accesses to IEEEsingle in libLLVM.so to refer to the original definition
> > 
> >    Does this means this is a glibc bug as well? dynamic linker needs to redirect
> > the reference in the .so to the copy as well?
> 
> This is a possibility, although I am not an expert on the dynamic linker.  I
> think
> however that we need to verify that the static linker is producing the
> correct
> dynamic relocations first.

Thanks, I will prepare a patch to fix the BFD static linker.
Comment 5 Jiong Wang 2017-06-07 15:21:19 UTC
(In reply to Nick Clifton from comment #3)
> Hi Jiong,
> 
>   Thanks for taking a look at this.
> 
> >    Looks like AArch64 could avoid emiting COPY relocation if it's a relocation
> > against writable section. We could adopt similar code from
> > elf_x86_64_adjust_dynamic_symbol or ppc64_elf_adjust_dynamic_symbol.
> 
> Ah - maybe not.  The original testcase (in the BZ) shows the bug happening
> when
> constant data is used rather than writeable data.  You can also extend the
> second
> testcase I posted to use constant data and the same problem happens.

Hi Nick,

It looks like those cases are relro sections, that the elf attribute is still writable, so COPY relocation still can be elimiated

I have posted the patch at https://sourceware.org/ml/binutils/2017-06/msg00070.html, both testcases on this PR passed.
Comment 6 Sourceware Commits 2017-06-08 08:44:50 UTC
The master branch has been updated by Jiong Wang <jiwang@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=bc327528fd2ccdc6c29ab6ae608085dddbad5cc8

commit bc327528fd2ccdc6c29ab6ae608085dddbad5cc8
Author: Jiong Wang <jiong.wang@arm.com>
Date:   Wed Jun 7 12:05:39 2017 +0100

    [AArch64] Allow COPY relocation elimination
    
      As discussed at the PR, this patch tries to avoid COPY relocation generation
    and propagate the original relocation into runtime if it was relocating on
    writable section.  The ELIMINATE_COPY_RELOCS has been set to true and it's
    underlying infrastructure has been improved so that the COPY reloc elimination
    at least working on absoluate relocations (ABS64) after this patch.
    
    bfd/
            PR ld/21532
            * elfnn-aarch64.c (ELIMINATE_COPY_RELOCS): Set to 1.
            (elfNN_aarch64_final_link_relocate): Also propagate relocations to
            runtime for copy relocation elimination cases.
            (alias_readonly_dynrelocs): New function.
            (elfNN_aarch64_adjust_dynamic_symbol): Keep the dynamic relocs instead
            of generating copy relocation if it is not against read-only sections.
            (elfNN_aarch64_check_relocs): Likewise.
    
    ld/
            * testsuite/ld-aarch64/copy-reloc-eliminate.d: New test.
            * testsuite/ld-aarch64/copy-reloc-exe-eliminate.s: New test source file.
            * testsuite/ld-aarch64/aarch64-elf.exp: Run new testcase.
Comment 7 Jiong Wang 2017-06-08 08:45:41 UTC
Mark as fixed.
Comment 8 Jiong Wang 2017-06-09 11:19:42 UTC
Temporarily reopen this bug as the fix was reverted because it will leak some PC-relative relocations in while PC-relative support on copy relocation elimination is going on as explained at the email.  So this fix itself is incomplete.

I will commit the fix later together PC-relative support after which is done and reviewed.

Sorry for the trouble.
Comment 9 Jiong Wang 2017-06-13 13:35:01 UTC
(In reply to Jiong Wang from comment #8)
> Temporarily reopen this bug as the fix was reverted because it will leak
> some PC-relative relocations in while PC-relative support on copy relocation
> elimination is going on as explained at the email.  So this fix itself is
> incomplete.


The PC-relative regression is looks like the following:

   unresolvable R_AARCH64_ADR_PREL_PG_HI21 relocation against symbol `_ZTIi@@CXXABI_1.3'


The failure was because the symbol "_ZTIi@@CXXABI_1.3" is referenced by both
absoluate relocation types R_AARCH64_ABS_64 and pc-relative relocation types
R_AARCH64_ADR_PREL_PG_HI21.  While The current elfNN_aarch64_check_relocs
implementation only allocate dynrelocs for absoluate relocation types 
R_AARCH64_NN, this caused elfNN_aarch64_adjust_dynamic_symbol missed the
reference of "_ZTIi@@CXXABI_1.3" by R_AARCH64_ADR_PREL_PG_HI21 which is a text
relocation that is read-only in which case we should keep copy relocation.

My understanding of the copy relocation elimination framwork in BFD linker
is we should always allocate dynrelocs for all those relocation types that are
possible to introduce copy relocations, and they are actually relocation types
that will be used to form address of external object in non-PIC executable.

So a complete fix should also fix the current elfNN_aarch64_check_relocs implementation to:

  * Also allocate dynrelocs for a few pc-relative relocation types which will
    be used to form object address under tiny, small memory model.
    
    As AArch64 dynamic linker does not support pc-relative relocations,
    we need to alway keep copy relocation for them later in
    adjust_dynamic_symbol if pc_count is not zero to avoid introducing
    runtime relocation for them.

  * For large memory model, AArch64 also use some MOVW_G* relocations to form
    object address under non-PIC mode.  We need to allocate dynrelocs for their
    reference on symbol as well. 

I am testing a fix and will post it after finishing a full test on binutils/gcc/g++ check plus some big project linking.
Comment 10 Sourceware Commits 2017-06-15 16:29:31 UTC
The master branch has been updated by Jiong Wang <jiwang@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=6353d82b8fa825c2143f41e84b0d5d4446c6e99a

commit 6353d82b8fa825c2143f41e84b0d5d4446c6e99a
Author: Jiong Wang <jiong.wang@arm.com>
Date:   Thu Jun 15 16:51:01 2017 +0100

    [AArch64] Allow COPY relocation elimination
    
      As discussed at the PR, this patch tries to avoid COPY relocation generation
    and propagate the original relocation into runtime if it was relocating on
    writable section.  The ELIMINATE_COPY_RELOCS has been set to true and it's
    underlying infrastructure has been improved so that the COPY reloc elimination
    at least working on absoluate relocations (ABS64) on AArch64.
    
      BFD linker copy relocation elimination framwork requires the backend to always
    allocate dynrelocs for all those relocation types that are possible to introduce
    copy relocations.  This is for adjust_dynamic_symbol hook to be able to get all
    symbol reference information.  Should one symbol is referenced by more than one
    relocations, if there is any of them needs copy relocation then linker should
    generate it.
    
    bfd/
    	PR ld/21532
    	* elfnn-aarch64.c (ELIMINATE_COPY_RELOCS): Set to 1.
    	(elfNN_aarch64_final_link_relocate): Also propagate relocations to
    	runtime for if there needs copy relocation elimination.
    	(need_copy_relocation_p): New function.  Return true for symbol with
    	pc-relative references and if it's against read-only sections.
    	(elfNN_aarch64_adjust_dynamic_symbol): Use need_copy_relocation_p.
    	(elfNN_aarch64_check_relocs): Allocate dynrelocs for relocation types
    	that are related with accessing external objects.
    	(elfNN_aarch64_gc_sweep_hook): Sync the relocation types with the change
    	in elfNN_aarch64_check_relocs.
    
    ld/
    	* testsuite/ld-aarch64/copy-reloc-exe-2.s: New test source file.
    	* testsuite/ld-aarch64/copy-reloc-2.d: New test.
    	* testsuite/ld-aarch64/copy-reloc-exe-eliminate.s: New test source file.
    	* testsuite/ld-aarch64/copy-reloc-eliminate.d: New test.
    	* testsuite/ld-aarch64/copy-reloc-so.s: Define new global objects.
    	* testsuite/ld-aarch64/aarch64-elf.exp: Run new tests.
Comment 11 Jiong Wang 2017-06-15 16:30:03 UTC
Mark as fixed.