23825 – Linker creates COPY relocs for reference to TLS symbols

Bug 23825 - Linker creates COPY relocs for reference to TLS symbols

Summary: Linker creates COPY relocs for reference to TLS symbols

Status:	ASSIGNED

Alias:	None

Product:	binutils
Classification:	Unclassified
Component:	ld (show other bugs)
Version:	2.31

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jim Wilson

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-10-25 15:10 UTC by Andreas Schwab
Modified:	2019-11-19 19:41 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	riscv64--
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andreas Schwab 2018-10-25 15:10:44 UTC

When linking elf/tst-tls12 from glibc the linker creates COPY relocs for the references to the TLS symbols a1 and a2, ending up overwriting the .preinit_array and .init_array contents.

$ objdump -t tst-tls12.o | grep 'a[12]'
0000000000000000         *UND*  0000000000000000 a1
0000000000000000         *UND*  0000000000000000 a2
$ objdump -R tst-tls12 | grep COPY
0000000000013d68 R_RISCV_COPY      a1
0000000000013d78 R_RISCV_COPY      a2
$ objdump -p tst-tls12 | grep INIT
  PREINIT_ARRAY        0x0000000000013d68
  PREINIT_ARRAYSZ      0x0000000000000008
  INIT_ARRAY           0x0000000000013d70
  INIT_ARRAYSZ         0x0000000000000008

Comment 1 Jim Wilson 2018-10-25 17:06:13 UTC

This is a feature of the RISC-V toolchain, which apparently isn't supported by any other toolchain, and which is known to be broken, but we don't yet know if it is a gcc, binutils, ld.so, or something else bug.  We have a compiler workaround that we don't want to deploy, which just disables this feature.
   https://github.com/riscv/riscv-gcc/pull/118
You could try using this if you have an immediate problem that needs fixing.

I have been trying to debug this problem in ld.so, which is where this appears to go wrong, but gdb isn't quite good enough yet.  That is one of the reasons why I've been spending time trying to get gdb working better.  This may remain broken for a while longer until I am able to figure out where exactly the problem is.

Have you run into a real problem that needs a fix, or is this just glibc testsuite cleanup?  I have a lot of problems to work on, and need to prioritize my efforts on the most important problems that remain.  FYI With my latest gcc and linux kernel patches, and with the math ulps file regenerated, I am down to 13 glibc failures on my unleashed board running fedora core 29.  I was hoping that was good enough for now.  Gdb is over a thousand testsuite failures, and I'd rather work on that.

Comment 2 Andreas Schwab 2018-10-29 14:53:52 UTC

It is unlikely that there are other uses of this construct out there.

Comment 3 Jim Wilson 2019-09-01 03:50:27 UTC

I got an internal bug report with a simplified testcase related to this, took another look, and found the problem.

hifiveu017:1097$ cat tmp.c
#include <stdio.h>
extern __thread int a;
int main (void) {printf ("a = %d\n", a); return 0; }
hifiveu017:1097$ cat tmp2.c
__thread int a = 10;
hifiveu017:1098$ gcc --shared -fpic -o libtmp.so tmp2.c
hifiveu017:1099$ gcc tmp.c libtmp.so -Wl,--rpath=`pwd`
hifiveu017:1100$ ./a.out
Segmentation fault (core dumped)
hifiveu017:1101$ 

Running objdump on a.out, we see

 15 .tdata        00000004  0000000000011df8  0000000000011df8  00000df8  2**2
                  ALLOC, THREAD_LOCAL
 16 .preinit_array 00000008  0000000000011df8  0000000000011df8  00000df8  2**0
                  CONTENTS, ALLOC, LOAD, DATA

So the problem here is that tdata has a size, and is alloc, but the next section preinit_array starts at the same address as tdata which is wrong.

I tracked this down a bit of code in ld/ldlang.c which does

            /* .tbss sections effectively have zero size.  */
            if (!IS_TBSS (os->bfd_section)
		|| bfd_link_relocatable (&link_info))
              dotdelta = TO_ADDR (os->bfd_section->size);
            else
              dotdelta = 0;
            dot += dotdelta;

This may be right for .tbss, but is not right for the RISC-V .tdata section.  The RISC-V .tdata section ended up this way because we have code in bfd/elfnn-riscv.c that does

      htab->sdyntdata =
        bfd_make_section_anyway_with_flags (dynobj, ".tdata.dyn",
                                            (SEC_ALLOC | SEC_THREAD_LOCAL
                                             | SEC_LINKER_CREATED));

So if there is a .tdata.dyn section and no .tdata section, .tdata.dyn gets merged into .tdata and .tdata ends up looking like a .tbss section and gets handled wrong by the linker.

There is also a second problem here, which is that a bss-like section can only work at the end of a segment, but there is no code in the linker script to ensure that .tdata.dyn ends up at the end of a segment.  It is is just mixed in with the other .tdata.* sections, with no guarantee that it is the last one.

I think the solution to this is to just claim that .tdata.dyn has contents, to ensure that space is allocated for it.  Normally, it is only used as a target for copy relocs, and hence has no contents, but this causes multiple problems.  This section will normally be small, so I don't think there is much performance loss from just pretending that it has contents.  I have a patch that does this, and it fixes my testcase, and elf/tst-tls12 from glibc.

Comment 4 Sourceware Commits 2019-09-01 04:25:59 UTC

The master branch has been updated by Jim Wilson <wilson@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=3e7bd7f24146f162565edf878840449f36a8d974

commit 3e7bd7f24146f162565edf878840449f36a8d974
Author: Jim Wilson <jimw@sifive.com>
Date:   Sat Aug 31 21:22:36 2019 -0700

    RISC-V: Fix linker problems with tls copy relocs.
    
    The linker doesn't allocate memory space for sections that are only SEC_ALLOC
    and SEC_THREAD_LOCAL.  See the IS_TBSS test in ld/ldlang.c.  So we need to
    pretend that .tdata.dyn sections have contents to get the right result.  It
    will be marked this way anyways if there is a .tdata section to merge with.
    
    	bfd/
    	PR 23825
    	* elfnn-riscv.c (riscv_elf_create_dynamic_sections): Add SEC_LOAD,
    	SEC_DATA, and SEC_HAS_CONTENTS to .tdata.dyn section.

Comment 5 Jim Wilson 2019-09-04 01:13:15 UTC

An alternative fix might be to rename the .tdata.dyn section to .tbss.dyn.  I haven't tested this yet, and haven't checked the psABI yet to see if this is OK.  But this should avoid the need to change the section flags, and avoids the need to add zero-filled data to the .tdata section for the TLS copy reloc targets.  I haven't closed this bug report yet because I wanted to try this.

Comment 6 Fangrui Song 2019-11-19 07:56:44 UTC

Drive-by comment coming from https://reviews.llvm.org/D70398 (I am a bit reluctant to let the change land. The design choice made in gcc seems fairly bad and I think there is still chance to fix it.)

Using the Local Exec TLS model for extern TLS variable accesses in position-dependent code (-fno-pic) is rather camp. 

A few architectures that I checked:

i386: R_386_TLS_IE
ARM: R_ARM_TLS_IE32
MIPS: R_MIPS_TLS_GOTTPREL
AArch64: R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC
PowerPC32: R_PPC_GOT_TPREL16
PowerPC64: R_PPC64_GOT_TPREL16_LO_DS

It does seem that RISC-V is in its own camp. A copy relocated TLS variable wastes a part of the TLS block for every thread. If we used the Initial Exec TLS model for extern TLS variable accesses, we could teach the linker to relax Initial Exec to Local Exec.

Comment 7 Rich Felker 2019-11-19 13:42:32 UTC

Please, just teach GCC not to do this! RISC-V has proper PC-relative addressing and thus the GOT access for IE model is essentially free. There is no reason it should be doing a hack with LE model that imposes a need for copy relocations, doubles the size of shared-library TLS that gets referenced in the main program, and (worst of all!) makes the size of TLS arrays/structures into ABI due to the copy relocation.

Comment 8 Rich Felker 2019-11-19 13:44:42 UTC

To clarify what I mean about ABI and copy relocations, prior to RISC-V doing this, there was no arch where it was unsafe to have extensible (existing part's meaning remains same, new parts added to end) arrays or structs in shared library TLS. This kind of thing is very useful for dispatch arrays, etc. Copy relocations break that.

Comment 9 Jim Wilson 2019-11-19 19:41:25 UTC

This is being discussed in
   https://github.com/riscv/riscv-elf-psabi-doc/issues/122
which is the proper place to discuss RISC-V ABI issues.