Bug 30186 - RTLD_DEEPBIND interacts badly with LD_PRELOAD
Summary: RTLD_DEEPBIND interacts badly with LD_PRELOAD
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-01 11:33 UTC by Matthew Parkinson
Modified: 2023-03-22 16:57 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Minimal example of issues with RTLD_DEEPBIND (956 bytes, application/x-gzip)
2023-03-01 11:33 UTC, Matthew Parkinson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthew Parkinson 2023-03-01 11:33:00 UTC
Created attachment 14726 [details]
Minimal example of issues with RTLD_DEEPBIND

Loading a library with RTLD_DEEPBIND ignores LD_PRELOAD on any symbols in libc.

# Minimal example 

I have constructed a minimal example to illustrate the problem (also attached to the issue):

  https://github.com/mjp41/deepbindexample

This creates four things

  * `libdeepbind.so` - a library that exposes `expose`, which calls `free` with a bad value
  * `libpreload.so` - a library that provides a `free` that does nothing
  * `main_works` - an application that loads libdeepbind.so with just `RTLD_NOW`
  * `main_fails` - an application that loads libdeepbind.so with `RTLD_NOW | RTLD_DEEPBIND`.

The two main applications just call `expose` on the loaded library `libdeepbind.so`.

If we run
```
LD_PRELOAD=./libpreload.so ./main_works
```
then the program runs successfully: the bad `free` is handled by the preload and ignored.

If we run
```
LD_PRELOAD=./libpreload.so ./main_fails
```
Then we get a segmentation fault from the libc allocator.

This behaviour is not great for debugging and allocator replacement for any application using `RTLD_DEEPBIND`.

# Concrete examples

Previously, for allocator specific overrides, this was addressed with malloc hooks, e.g.

https://github.com/jemalloc/jemalloc/commit/4bb09830133ffa8b27a95bc3727558007722c152

The jemalloc fix will fail silently now as the hooks have been removed in glibc, and jemalloc didn't include `malloc.h` to receive the deprecation warning.

The current behaviour causes issues for Address Sanitizer:

   https://github.com/google/sanitizers/issues/611

and for applications such as PHP cannot override the allocator because of this behaviour:

   https://github.com/php/php-src/issues/10670

# Next steps

As a possible solution, I wonder if there should be an explicit preload scope that is applied first to everything include RTLD_DEEPBIND libraries.  However, I know very little about the loader, and cannot assess the compatibility issues that might cause.

I am raising this issue as requested by Adhemerval Zanella in response to my Twitter thread on the topic 
  https://twitter.com/ParkyMatthew/status/1630500641708683268

There is a previous bug raise on this issue: 
  https://sourceware.org/bugzilla/show_bug.cgi?id=25533
That is marked as WONTFIX. I am not convinced a shared understanding was reached on the topic in that discussion.  I hope my minimal example helps to make the problem clearer.
Comment 1 Adhemerval Zanella 2023-03-06 16:17:25 UTC
IMHO, the main issue is not clear how LD_PRELOAD should work with RTLD_DEEPBIND: should LD_PRELOAD act as another DT_NEEDED library linked against the binary (although with high priority as it was first in static linking order) or should it be handled differently in the search namespace (where RTLD_DEEPBIND always takes it in consideration).

As Florian has put on https://sourceware.org/bugzilla/show_bug.cgi?id=25533#c8 I don't think we can change LD_PRELOAD semantic without the potential breakage on a lot of RTLD_DEEPBIND usages.

One possibility is to maybe another env. var to specify a different preload mode or add another flag on dlopen; both which I am not really fond of. Florian might have another idea.
Comment 2 Matthew Parkinson 2023-03-08 20:41:29 UTC
So based on a suggestion from a colleague Paul Lietar, we have an alternative suggestion that is less invasive. 

Previously, the malloc hooks provided a layer of indirection which has now been removed.  We wonder if you could reintroduce the layer of indirection but only accessible to the loader.

If you make 
   
   extern void* malloc(size_t s)
   {
     return __libc_malloc(s);
   }

Now, if you prevent inlining of __libc_malloc, then you can expose __libc_malloc as something that can be overridden by LD_PRELOAD.  As an RTLD_DEEPBINDed library would call malloc, this would immediately jump to what ever symbol exists for __libc_malloc, when libc was loaded, and this would account for LD_PRELOAD.

I have an example that suggests this could work:

  https://github.com/mjp41/deepbindexample/tree/main/solution

Many allocators already provide symbols for __libc_malloc, e.g.:
 https://github.com/microsoft/mimalloc/blob/dd7348066fe40e8bf372fa4e9538910a5e24a75f/src/alloc-override.c#L273-L285

https://github.com/jemalloc/jemalloc/blob/09e4b38fb1f9a9b505e35ac13b8f99282990bc2c/src/jemalloc.c#L3143-L3175

So these would just start working with RTLD_DEEPBIND.

The obvious question is how much performance would be lost in adding the indirection.  It would not affect existing allocator overrides as they could override both, but for programs using the libc allocator, there would be a cost.

I am sure there are many complexities that I am missing, but I thought this is worth mentioning.
Comment 3 Adhemerval Zanella 2023-03-10 19:20:49 UTC
(In reply to Matthew Parkinson from comment #2)
> So based on a suggestion from a colleague Paul Lietar, we have an
> alternative suggestion that is less invasive. 
> 
> Previously, the malloc hooks provided a layer of indirection which has now
> been removed.  We wonder if you could reintroduce the layer of indirection
> but only accessible to the loader.
> 
> If you make 
>    
>    extern void* malloc(size_t s)
>    {
>      return __libc_malloc(s);
>    }
> 
> Now, if you prevent inlining of __libc_malloc, then you can expose
> __libc_malloc as something that can be overridden by LD_PRELOAD.  As an
> RTLD_DEEPBINDed library would call malloc, this would immediately jump to
> what ever symbol exists for __libc_malloc, when libc was loaded, and this
> would account for LD_PRELOAD.
> 
> I have an example that suggests this could work:
> 
>   https://github.com/mjp41/deepbindexample/tree/main/solution

This essentially adds another PLT indirection on *all* memory allocations, even for default binaries that do no dlopen or do not use RTLD_DEEPBIND.  I don't think this is a good tradeoff.

> 
> Many allocators already provide symbols for __libc_malloc, e.g.:
>  https://github.com/microsoft/mimalloc/blob/
> dd7348066fe40e8bf372fa4e9538910a5e24a75f/src/alloc-override.c#L273-L285
> 
> https://github.com/jemalloc/jemalloc/blob/
> 09e4b38fb1f9a9b505e35ac13b8f99282990bc2c/src/jemalloc.c#L3143-L3175
> 
> So these would just start working with RTLD_DEEPBIND.

__libc_malloc is a strong alias to malloc, so there is no need to override for shared libraries. Jemalloc states it is required to enable static linking, which I am not sure it is true any longer (as long the malloc implementation implements all requires functions).

> 
> The obvious question is how much performance would be lost in adding the
> indirection.  It would not affect existing allocator overrides as they could
> override both, but for programs using the libc allocator, there would be a
> cost.

I don't think the performance overhead is acceptable, and I am almost sure other maintainers will frown upon it as well.

I really think a tunable or an extra option on LD_PRELOAD would be better to instruct that RTLD_DEEPBIND should always consider LD_PRELOADS libraries on namespace resolution.  It has the advantage of being backportable (where a new symbol indirection would require another set of symbols).

> 
> I am sure there are many complexities that I am missing, but I thought this
> is worth mentioning.
Comment 4 Matthew Parkinson 2023-03-11 11:15:38 UTC
(In reply to Adhemerval Zanella from comment #3)

Thank you for spending your time on this issue.  

> This essentially adds another PLT indirection on *all* memory allocations,
> even for default binaries that do no dlopen or do not use RTLD_DEEPBIND.  I
> don't think this is a good tradeoff.
> 
> > The obvious question is how much performance would be lost in adding the
> > indirection.  It would not affect existing allocator overrides as they could
> > override both, but for programs using the libc allocator, there would be a
> > cost.
> 
> I don't think the performance overhead is acceptable, and I am almost sure
> other maintainers will frown upon it as well.

Thanks for the feedback. It is really useful to know.  I have modified the approach to reduce the cost, so I would really appreciate your opinion on this. 

I have put up a simple example here:
  https://github.com/mjp41/deepbindexample/tree/main/solutionopt

The basic idea is to have a library local global variable that specifies if a direct call should be performed.  With this approach the common case would have a single double PLTed call, and then all subsequent calls would be direct (and potentially inlinable by the compiler.  Here is the main part of the code that would be like a libc malloc implementation:

```
#include "stdio.h"
#include "stdbool.h"

__attribute__((visibility("hidden")))
bool direct_call = false;

__attribute__((visibility("hidden")))
void message_impl()
{
    puts("lib.c: message_impl");
}

extern void internal_message()
{
    direct_call = true;
    puts("lib.c: internal_message -> message_impl");
    message_impl();
}

extern void message()
{
    if (!direct_call)
    {
        puts("lib.c: message -> internal_message");
        internal_message();
        return;
    }
    puts("lib.c: message -> message_impl");
    message_impl();
}
```

The first call to `message` will call `internal_message` and then the implementation `message_impl`. The call to `internal_message` will set the library local variable `direct_call` and then any subsequent call to `message` can directly call `message_impl`.

I think this would have similar performance to the `malloc_hook` code, which tested if the hook had been overwritten, and if it had called it.

So this seems closer to acceptable performance, but still introduces a load, comparison and jump. One x64 it adds:

```
    1188:       80 3d a2 2e 00 00 00    cmpb   $0x0,0x2ea2(%rip)        # 4031 <direct_call>
    118f:       74 1f                   je     11b0 <message+0x30>
```

I am not familiar with all the changes to dlmalloc that are in ptmalloc, but in dlmalloc there is some initialisation checking code that could probably be combined with this test to improve performance.

> > 
> > Many allocators already provide symbols for __libc_malloc, e.g.:
> >  https://github.com/microsoft/mimalloc/blob/
> > dd7348066fe40e8bf372fa4e9538910a5e24a75f/src/alloc-override.c#L273-L285
> > 
> > https://github.com/jemalloc/jemalloc/blob/
> > 09e4b38fb1f9a9b505e35ac13b8f99282990bc2c/src/jemalloc.c#L3143-L3175
> > 
> > So these would just start working with RTLD_DEEPBIND.
> 
> __libc_malloc is a strong alias to malloc, so there is no need to override
> for shared libraries. Jemalloc states it is required to enable static
> linking, which I am not sure it is true any longer (as long the malloc
> implementation implements all requires functions).

Good to know, thank you.
 
> I really think a tunable or an extra option on LD_PRELOAD would be better to
> instruct that RTLD_DEEPBIND should always consider LD_PRELOADS libraries on
> namespace resolution.  

I think long term this would address a wider range of problems, and am completely in favour of this. I am concerned it might be a lot of work, so take a while to address?  Hence, why I am considering allocator specific solutions.

> It has the advantage of being backportable (where a
> new symbol indirection would require another set of symbols).

Just to clarify.  Are you saying patching old versions cannot introduce new symbols?  Would that be true if `__libc_malloc` and `malloc` were unaliased?  I.e. would repurposing the symbol be a valid backport?
Comment 5 Matthew Parkinson 2023-03-13 11:52:50 UTC
(In reply to Matthew Parkinson from comment #4)
> (In reply to Adhemerval Zanella from comment #3)
> I have put up a simple example here:
>   https://github.com/mjp41/deepbindexample/tree/main/solutionopt

I have updated the example based on further discussion with Paul Lietar.  The example now checks for the existence of a weak symbol:

```C
#include "stdio.h"
#include "stdbool.h"

__attribute__((visibility("hidden")))
void message_impl()
{
    puts("lib.c: message_impl");
}

__attribute__((weak))
extern void override_message();

extern void message()
{
    if (override_message != NULL)
    {
        puts("lib.c: message -> override_message");
        override_message();
        return;
    }
    puts("lib.c: message -> message_impl");
    message_impl();
}
```

This is simpler and doesn't require the global variable.  It adds a weak-linkage undefined symbol for each of the things that can be overridden.
Comment 6 Matthew Parkinson 2023-03-22 10:52:05 UTC
> > (In reply to Adhemerval Zanella from comment #3)

Do you have advice on how I can progress this?  I am happy to attempt to put together a patch for introducing the extra calls to intercept in LD_PRELOAD as outlined in Comment 5.  But I do not want to start that unless that is some likelihood it would be taken.

The new approach would introduce a couple of instructions on the fast path, but they can easily be speculated, so should incur minimal cost.
Comment 7 Adhemerval Zanella 2023-03-22 11:39:19 UTC
(In reply to Matthew Parkinson from comment #5)
> (In reply to Matthew Parkinson from comment #4)
> > (In reply to Adhemerval Zanella from comment #3)
> > I have put up a simple example here:
> >   https://github.com/mjp41/deepbindexample/tree/main/solutionopt
> 
> I have updated the example based on further discussion with Paul Lietar. 
> The example now checks for the existence of a weak symbol:
> 
> ```C
> #include "stdio.h"
> #include "stdbool.h"
> 
> __attribute__((visibility("hidden")))
> void message_impl()
> {
>     puts("lib.c: message_impl");
> }
> 
> __attribute__((weak))
> extern void override_message();
> 
> extern void message()
> {
>     if (override_message != NULL)
>     {
>         puts("lib.c: message -> override_message");
>         override_message();
>         return;
>     }
>     puts("lib.c: message -> message_impl");
>     message_impl();
> }
> ```
> 
> This is simpler and doesn't require the global variable.  It adds a
> weak-linkage undefined symbol for each of the things that can be overridden.

This might work, the hook requires a global load and branch but it should be 
a predictable branch and performance should be ok.  I would require to add
additional hooks on every malloc function, adding another configuration layer
which is an extra complexity for malloc interposition (which I am not very
found on).

You might ask what other maintainers think about this approach, I personally
still think we should properly add preload option to have a more generic
fix to fix symbol interpostion with RTLD_DEEPBIND for all possible symbols
and not a ad-hoc fix for malloc.
Comment 8 Matthew Parkinson 2023-03-22 11:48:49 UTC
> This might work, the hook requires a global load and branch but it should be 
> a predictable branch and performance should be ok.  I would require to add
> additional hooks on every malloc function, adding another configuration layer
> which is an extra complexity for malloc interposition (which I am not very
> found on).

I agree the complexity is not nice.  
 
> You might ask what other maintainers think about this approach,

Should I email libc-alpha to achieve this?

> I personally
> still think we should properly add preload option to have a more generic
> fix to fix symbol interpostion with RTLD_DEEPBIND for all possible symbols
> and not a ad-hoc fix for malloc.

This is not something I understand well enough to contribute to, so I am just trying to understand if there is something I can do, or if I need to wait for a solution.

Thanks
Comment 9 Adhemerval Zanella 2023-03-22 11:57:36 UTC
(In reply to Matthew Parkinson from comment #8)
> > This might work, the hook requires a global load and branch but it should be 
> > a predictable branch and performance should be ok.  I would require to add
> > additional hooks on every malloc function, adding another configuration layer
> > which is an extra complexity for malloc interposition (which I am not very
> > found on).
> 
> I agree the complexity is not nice.  
>  
> > You might ask what other maintainers think about this approach,
> 
> Should I email libc-alpha to achieve this?

Yes please, I would like to hear from other maintainers if this approach is sound and if I am not missing anything corner spot.

> 
> > I personally
> > still think we should properly add preload option to have a more generic
> > fix to fix symbol interpostion with RTLD_DEEPBIND for all possible symbols
> > and not a ad-hoc fix for malloc.
> 
> This is not something I understand well enough to contribute to, so I am
> just trying to understand if there is something I can do, or if I need to
> wait for a solution.
> 
> Thanks
Comment 10 Siddhesh Poyarekar 2023-03-22 15:45:39 UTC
We just removed hooks recently because they're a common exploit primitive, lets not go back in the direction of adding global function pointers again.

The core issue here is the ordering of RTLD_DEEPBIND vs LD_PRELOAD and IMO the fact that LD_PRELOAD doesn't win is something that should be fixed.  I think it always should win but there may be reasons to have RTLD_DEEPBIND win that I'm unaware of.  Lets discuss fixing that ordering issue instead on the libc-alpha mailing list.
Comment 11 Adhemerval Zanella 2023-03-22 16:25:09 UTC
(In reply to Siddhesh Poyarekar from comment #10)
> We just removed hooks recently because they're a common exploit primitive,
> lets not go back in the direction of adding global function pointers again.

A weak reference generates a GOTPCREL relocation, and with relro should no be subject to same issues of malloc hook (it is either reallocated at loading, or kept null during process execution).

> 
> The core issue here is the ordering of RTLD_DEEPBIND vs LD_PRELOAD and IMO
> the fact that LD_PRELOAD doesn't win is something that should be fixed.  I
> think it always should win but there may be reasons to have RTLD_DEEPBIND
> win that I'm unaware of.  Lets discuss fixing that ordering issue instead on
> the libc-alpha mailing list.
Comment 12 Adhemerval Zanella 2023-03-22 16:25:58 UTC
(In reply to Siddhesh Poyarekar from comment #10)
> We just removed hooks recently because they're a common exploit primitive,
> lets not go back in the direction of adding global function pointers again.

A weak reference generates a GOTPCREL relocation, and with relro should not be subject to same issues of malloc hook (it is either reallocated at loading, or kept null during process execution).

> 
> The core issue here is the ordering of RTLD_DEEPBIND vs LD_PRELOAD and IMO
> the fact that LD_PRELOAD doesn't win is something that should be fixed.  I
> think it always should win but there may be reasons to have RTLD_DEEPBIND
> win that I'm unaware of.  Lets discuss fixing that ordering issue instead on
> the libc-alpha mailing list.
Comment 13 Siddhesh Poyarekar 2023-03-22 16:57:35 UTC
If the malloc overriding approach is what we want to focus on, I'd strongly prefer a tunables based feature that influences relocation; I've advocated for that for a while now.  

For example, a glibc.malloc.impl=jemalloc would give preference to libjemalloc.so when resolving the malloc suite of symbols (perhaps resolve early even with lazy binding), assuming that it has been already loaded either through DT_NEEDED or with LD_PRELOAD.  Maybe it would be useful to make this independent of LD_PRELOAD by searching and loading libjemalloc.so whenever this tunable is set, essentially obsoleting the LD_PRELOAD method of overriding the allocator.

That will take care of all the current issues with overriding implementations:

- No need for LD_PRELOAD anymore
- Ignored for setuid binaries (unless we allow this with systemwide tunables in future)
- No additional indirection
- relro protection
- No need to alter malloc implementations to adjust to this new method

What do you think?