global pointer gets overwritten with dlopen(3) on RISC-V

Lukasz Stelmach l.stelmach@samsung.com
Fri May 12 14:21:09 GMT 2023


Hi,

We've encountered an issue of a program misbehaving due to its gp value
being overwritten. Let me present our setup and the exact sequence of
events.

We've got a program (the testee) written in C that we test with another
one (a testing harness, the tester) written in C++ with gtest. So far,
so good. To make the testing and inspection of the internal state of the
testee easier the tester does not start the testee as a separate process
but loads it with dlopen(3) and calls the testee's main() function. Data
structures of the testee get initialised but the main() exits (as
desired) due to some unmet requirements. But this is fine. The code of
the testee remains usable and the tester starts calling it function by
function.

Alas, this is the point where things go south. What is worse they do so
in a semi-random fashion. We've seen several different behaviours they
were consistent between runs, but sometimes changed after compilation.
Long story short, both the tester and testee were compiled and linked
with relaxed relocations turned on. Both chunks of code assumed
different value of the gp register, of course.

What happens — step by step:

1. The tester starts and sets its the gp value in _start (see sysdeps/riscv/start.S)

2. The tester loads the testee with dlopen(3)

3. The dlopen(3) calls load_gp via preinit_array (see sysdeps/riscv/start.S)

4. The testee's code works fine, because the the gp register holds the value
   from loaded with the testee's load_gp.

5. The tester's code fails in many curious ways (e.g. stdio doesn't work,
   different functions are called than were supposed to because
   ofoverwrittent GOT entries etc.) Even in situations when the tester
   didn't fail until the end of its main(), it always caught SIGSEGV in
   __do_global_dtors_aux().

Our fix was to link the tester with -mno-relax option set. And it
worked. However, it took us a few days to understand all the details and
we think something needs to be done to avoid the confusing semi-random
failure mode even though we recognise our use-case is somewhat unusual.

Possible general solutions:

1. Make -mno-relax the default for ld(1) (on Linux?). We have no
benchmarks whatsoever, but global variables aren't very popular in
application code these days and the gp register allows access to a
single memory page (4kB) only. No big deal really.

2. Make dlopen(3) (or any appropriate piece of code deep down in glibc)
recognise the situation where the gp has been set and may be overwritten and
report error. Neither overwriting the the gp nor loading a binary without
(e.g. removing load_gp from preinit_array. why is it there in the first
place?) would give us a working code.

The above solutions aren't mutually exclusive and implementing both of
them seems like a good idea.

Are there any other ways to avoid misbehaviour when a process dlopens an
executable binary and calls its code?

Kind regards,
-- 
Łukasz Stelmach
Samsung R&D Institute Poland
Samsung Electronics
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 487 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/binutils/attachments/20230512/86b8dca5/attachment.sig>


More information about the Binutils mailing list