Bug 28218 - ld.so: ifunc resolver calls a lazy PLT. When does it work?
Summary: ld.so: ifunc resolver calls a lazy PLT. When does it work?
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-10 17:36 UTC by Fangrui Song
Modified: 2021-08-15 18:56 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fangrui Song 2021-08-10 17:36:59 UTC
If an ifunc resolver calls a PLT with lazy JUMP_SLOT, should it work?

My impression is that this does not necessarily need to work.

That said, the R_X86_64_IRELATIVE in .rela.plt is special.
PR ld/13302 patched GNU ld to place R_X86_64_IRELATIVE in .rela.plt after JUMP_SLOT.
This allows lazy PLT calls.

I think if this case is to be supported, ld.so is to be patched instead.
IRELATIVE relocations are eagerly resolved.
.rela.dyn is conceptually a better place, similar to the .plt.got GLOB_DAT optimization.
IRELATIVE relocations are not PLT.

FreeBSD rtld resolves non-IRELATIVE relocations in all modules, then IRELATIVE relocations in all modules.
This allows flexibility on what can be used in ifunc resolvers, and
linkers don't need to place IRELATIVE in special places.

cat > a.c <<eof
  #include <stdio.h>
  
  int a_impl() { return 42; }
  void *a_resolver() {
    puts("a_resolver");
    return (void *)a_impl;
  }
  int a() __attribute__((ifunc("a_resolver")));
  
  // .rela.dyn.rel => R_X86_64_64 referencing STT_GNU_IFUNC in .rela.dyn
  int (*fptr_a)() = a;
  
  int main() { printf("%d\n", a()); }
eof

cc -fpie -c a.c
cc -fuse-ld=bfd -pie a.o -o a


% ./a
[1]    170657 segmentation fault  ./a
% readelf -Wr a

Relocation section '.rela.dyn' at offset 0x4b0 contains 9 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000003de8  0000000000000008 R_X86_64_RELATIVE                         1150
0000000000003df0  0000000000000008 R_X86_64_RELATIVE                         1110
0000000000004038  0000000000000008 R_X86_64_RELATIVE                         4038
0000000000003fd8  0000000100000006 R_X86_64_GLOB_DAT      0000000000000000 _ITM_deregisterTMCloneTable + 0
0000000000003fe0  0000000400000006 R_X86_64_GLOB_DAT      0000000000000000 __libc_start_main@GLIBC_2.2.5 + 0
0000000000003fe8  0000000500000006 R_X86_64_GLOB_DAT      0000000000000000 __gmon_start__ + 0
0000000000003ff0  0000000600000006 R_X86_64_GLOB_DAT      0000000000000000 _ITM_registerTMCloneTable + 0
0000000000003ff8  0000000700000006 R_X86_64_GLOB_DAT      0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0
0000000000004040  0000000000000025 R_X86_64_IRELATIVE                        1160

Relocation section '.rela.plt' at offset 0x588 contains 3 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000004018  0000000200000007 R_X86_64_JUMP_SLOT     0000000000000000 puts@GLIBC_2.2.5 + 0
0000000000004020  0000000300000007 R_X86_64_JUMP_SLOT     0000000000000000 printf@GLIBC_2.2.5 + 0
0000000000004028  0000000000000025 R_X86_64_IRELATIVE                        1160



`int (*fptr_a)() = a;` leads to an R_X86_64_IRELATIVE in .rela.dyn .
If lazy binding is used, when the R_X86_64_IRELATIVE is resolved, R_X86_64_JUMP_SLOT(puts) hasn't been resolved yet, and the program crashes.

R_X86_64_IRELATIVE can be seen as an optimization.
R_X86_64_64 referencing an STT_GNU_IFUNC symbol is a different representation.


---

cat > b.c <<eof
   #include <stdio.h>

   int b_impl() { return 42; }
   void *b_resolver() {
     puts("b resolver");
     return (void *)b_impl;
   }
   int b() __attribute__((ifunc("b_resolver")));

   int (*fptr_b)() = b;
eof
cc b.c -fpic -shared -o b.so

Make b.so a DT_NEEDED of the executable and see the crash again.
In this case, R_X86_64_64 is used instead of IRELATIVE.
(b is preemptible, so IRELATIVE cannot be used.)

https://sourceware.org/pipermail/libc-alpha/2021-August/129968.html says
"I don't believe this is the use case we want to support."
Comment 1 Alan Modra 2021-08-12 06:16:39 UTC
> FreeBSD rtld resolves non-IRELATIVE relocations in all modules, then
> IRELATIVE relocations in all modules.
Yes, this is the correct way to do things, and the only general way to support cross-module ifunc resolvers that themselves might need to be relocated.  We have known this for a long time.  See for example https://gcc.gnu.org/legacy-ml/gcc-patches/2009-07/msg01307.html
Comment 2 Florian Weimer 2021-08-12 07:36:34 UTC
(In reply to Alan Modra from comment #1)
> > FreeBSD rtld resolves non-IRELATIVE relocations in all modules, then
> > IRELATIVE relocations in all modules.
> Yes, this is the correct way to do things, and the only general way to
> support cross-module ifunc resolvers that themselves might need to be
> relocated.  We have known this for a long time.  See for example
> https://gcc.gnu.org/legacy-ml/gcc-patches/2009-07/msg01307.html

Cross-module IFUNC resolvers that have relocation dependencies need more deferral than just IRELATIVE relocations. Regular relocations which target an IFUNC symbol need to be deferred, and copy relocations as well. Even that does not completely address all cases.

My impression is that the glibc maintainers think that IFUNC resolvers must not use have relocation dependencies. This could mean that they can only be implemented in assembler on some targets.

Initial comments on a proposal were positive: https://sourceware.org/legacy-ml/libc-alpha/2017-01/msg00468.html But I think during later discussions, this approach was rejected.
Comment 3 Fangrui Song 2021-08-15 01:32:13 UTC
(In reply to Florian Weimer from comment #2)
> (In reply to Alan Modra from comment #1)
> > > FreeBSD rtld resolves non-IRELATIVE relocations in all modules, then
> > > IRELATIVE relocations in all modules.
> > Yes, this is the correct way to do things, and the only general way to
> > support cross-module ifunc resolvers that themselves might need to be
> > relocated.  We have known this for a long time.  See for example
> > https://gcc.gnu.org/legacy-ml/gcc-patches/2009-07/msg01307.html
> 
> Cross-module IFUNC resolvers that have relocation dependencies need more
> deferral than just IRELATIVE relocations. Regular relocations which target
> an IFUNC symbol need to be deferred, and copy relocations as well. Even that
> does not completely address all cases.
> 
> My impression is that the glibc maintainers think that IFUNC resolvers must
> not use have relocation dependencies. This could mean that they can only be
> implemented in assembler on some targets.
> 
> Initial comments on a proposal were positive:
> https://sourceware.org/legacy-ml/libc-alpha/2017-01/msg00468.html But I
> think during later discussions, this approach was rejected.

I have read https://libc-alpha.sourceware.narkive.com/EDssYfrx/ifunc-resolver-scheduling-bugs-21041-20019

I think using a dependency based topological sort is probably over-engineering.
A two-pass approach (used by FreeBSD, mentioned by Szabolcs Nagy) works well.
Cross-DSO calls are via JUMP_SLOT relocs and cross-DSO variable accesses are
via GLOB_DAT or absolute relocations. Such relocs have been properly set up in
the first pass of relocation resolving. Here is a somewhat complex case.

cat > ./a.c <<eof
  #include <stdio.h>

  int a_impl() { return 42; }
  void *a_resolver() {
    puts("a_resolver");
    return (void *)a_impl;
  }
  int a() __attribute__((ifunc("a_resolver")));

  // .rela.dyn.rel => R_X86_64_64 referencing STT_GNU_IFUNC in .rela.dyn
  int (*fptr_a)() = a;
  int b(); extern int (*fptr_b)();
  int c(); extern int (*fptr_c)();

  int main() {
    printf("%d\n", a());
    printf("b: %p %p\n", fptr_b, b);
    printf("c: %p %p\n", fptr_c, c);
  }
eof
cat > ./b.c <<eof
  #include <stdio.h>
  void c();
  int b_impl() { return 43; }
  void *b_resolver() {
    puts("b_resolver");
    c(); // c is defined in c.so
    return (void *)b_impl;
  }
  int b() __attribute__((ifunc("b_resolver")));
  int (*fptr_b)() = b;
eof
cat > ./c.c <<eof
  #include <stdio.h>
  int c_impl() { return 44; }
  void *c_resolver() {
    puts("c_resolver");
    return (void *)c_impl;
  }
  int c() __attribute__((ifunc("c_resolver")));
  int (*fptr_c)() = c;
eof
cat > ./Makefile <<'eof'
a: a.c b.so c.so
	cc -pie -fpie a.c ./b.so ./c.so -l dl -o $@

b.so: b.c
	cc -shared -fpic b.c -o $@

c.so: c.c
	cc -shared -fpic c.c -o $@
eof
```

Note that in `b.so`, the ifunc resolver has a PLT call to `c.so`.
In real world applications, `c` may be some performance critical functions like `memset`.

`./a` obviously crashes with glibc. `./a` works on FreeBSD.
```text
c_resolver
b_resolver
c_resolver
a_resolver
b_resolver
c_resolver
42
b: 0x80106e670 0x80106e670
c: 0x801072630 0x801072630
```

Also works when `a` is built with `-no-pie -fno-pic` where there are copy relocations.
We can let `b.so` depend on `c.so`, swap the link order of `./b.so` and `./c.so`. Still works.

The approach also works with various examples where an ifunc resolver returns a
variable referencing an implementation.


> The main problem with the two-pass approach is that there is no place
> where it can cache symbol lookup results. If we assume that relocation
> performance is dominated by symbol lookups (which is reasonable IMHO),
> we are looking at a very substantial performance hit for the two-pass
> approach. [...]

We can handle just IRELATIVE and ignore R_X86_64_64 referencing STT_GNU_IFUNC.

* Check whether a module has IRELATIVE. If not, skip the second pass.
* If yes, handle IRELATIVE in the second pass.
Comment 4 Fangrui Song 2021-08-15 18:56:44 UTC
Apologies. I think an init order based approach is needed.
FreeBSD rtld does use a similar approach.

* Resolve non-`COPY` non-`IRELATIVE` non-`STT_GNU_IFUNC` relocations in all objects. Record what ifunc relocations categories have appeared.
* Resolve `COPY` relocations
* Prepare a list used to call init functions (if A depends on B, B is ordered before A)
* Resolve relocations in the init order
  + Resolve `IRELATIVE` relocations
  + Resolve other `.rela.dyn` relocations referencing `STT_GNU_IFUNC` (absolute relocations)
  + If neither `LD_BIND_NOW` nor `DF_1_NOW`, resolve `JUMP_SLOT` relocations

Its rtld just goes over the relocations in multiple passes.
In the lazy binding mode, when a `JUMP_SLOT` relocation is called, the PLT trampoline calls the ifunc resolver.

For glibc, I think https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/fw/bug21041 is worth pursuing.


Below is an example involving the executable `a` and 3 DSOs b.so, c.so, d.so.
b.so's ifunc resolver calls c.
c.so's ifunc resolver calls d.

cat > ./a.c <<eof
#include <stdio.h>

int a_impl() { return 42; }
void *a_resolver() {
  puts("a_resolver");
  return (void *)a_impl;
}
int a() __attribute__((ifunc("a_resolver")));

int (*fptr_a)() = a;
int b(); extern int (*fptr_b)();
int c(); extern int (*fptr_c)();

int main() {
  printf("%d\n", a());
  b();
  printf("b: %p %p\n", fptr_b, b);
  printf("c: %p %p\n", fptr_c, c);
}
eof
cat > ./b.c <<eof
#include <stdio.h>

void c();
int b_impl() { return 42; }
void *b_resolver() {
  puts("b_resolver");
  c();
  return (void *)b_impl;
}
int b() __attribute__((ifunc("b_resolver")));
int (*fptr_b)() = b;
eof
cat > ./c.c <<eof
#include <stdio.h>

int d();
int c_impl() { return 42; }
void *c_resolver() {
  puts("c_resolver");
  d();
  return (void *)c_impl;
}
int c() __attribute__((ifunc("c_resolver")));
int (*fptr_c)() = c;
eof
cat > ./d.c <<eof
#include <stdio.h>

int d_impl() { return 42; }
void *d_resolver() {
  puts("d_resolver");
  return (void *)d_impl;
}
int d() __attribute__((ifunc("d_resolver")));
int (*fptr_d)() = d;
eof
cat > ./Makefile <<'eof'
a: a.c b.so c.so d.so
	${CC} -g -fpie a.c ./b.so ./c.so ./d.so -pie -ldl -o $@

.SUFFIXES: .so
.c.so:
	${CC} -g -fpic $< -shared -o $@
eof

(On Linux, use `bmake a` to build.)

I have annotated the FreeBSD output with comments.

```
% ./a
# resolve d.so in init order
d_resolver  # R_X86_64_64

# resolve c.so in init order
c_resolver  # R_X86_64_64
d_resolver  #   triggered lazy JUMP_SLOT

# resolve b.so in init order
b_resolver  # R_X86_64_64
c_resolver  #   triggered lazy JUMP_SLOT

# resolve a in init order
a_resolver  # R_X86_64_IRELATIVE
b_resolver  # R_X86_64_GLOB_DAT
c_resolver  # R_X86_64_GLOB_DAT

42

b_resolver  # lazy JUMP_SLOT

b: 0x80106f670 0x80106f670
c: 0x801073670 0x801073670

% LD_BIND_NOW=1 ./a
# resolve d.so in init order
d_resolver

# resolve c.so in init order
d_resolver  # eager JUMP_SLOT
c_resolver  # R_X86_64_64

# resolve b.so in init order
c_resolver  # eager JUMP_SLOT
b_resolver  # R_X86_64_64

# resolve a in init order
a_resolver  # R_X86_64_IRELATIVE
b_resolver  # eager JUMP_SLOT
b_resolver  # R_X86_64_GLOB_DAT
c_resolver  # R_X86_64_GLOB_DAT

42
b: 0x80106f670 0x80106f670
c: 0x801073670 0x801073670
```


When `a` is built with `-no-pie -fno-pic`, copy relocataions and canonical PLT entries are used.
`R_X86_64_64` relocations in `b.so` and `c.so` are bound to canonical PLT entries, so there are fewer resolver calls.

```text
d_resolver
a_resolver
42
b_resolver
c_resolver
d_resolver
c: 0x201ca0 0x201ca0
b: 0x201c90 0x201c90
```