24776 – pthread_key_create, pthread_setspecific are incompatible with dlmopen

Bug 24776 - pthread_key_create, pthread_setspecific are incompatible with dlmopen

Summary: pthread_key_create, pthread_setspecific are incompatible with dlmopen

Status:	NEW

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	nptl (show other bugs)
Version:	2.30

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Duplicates (2):	26955 32135 (view as bug list)
Depends on:
Blocks:

Reported:	2019-07-05 18:17 UTC by Florian Weimer
Modified:	2024-09-03 06:35 UTC (History)
CC List:	10 users (show)

See Also:	24773
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Baojun Wang's reproducer from the mailing list thread (4.18 KB, application/octet-stream) 2019-07-05 18:29 UTC, Florian Weimer	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Florian Weimer 2019-07-05 18:17:13 UTC

pthread_key_create and pthread_setspecific use the malloc of the respective namespace.  The memory is freed in __nptl_deallocate_tsd, again using the malloc of the namespace.

These can be different if the base namespace creates a thread, another namespace calls pthread_setspecific, and then the thread exits (by returning from the start function, so via the base namespace).

After some heroic patching, this is where the example in <https://sourceware.org/ml/libc-help/2019-06/msg00026.html> fails, and that part will be difficult to address.

Comment 1 Florian Weimer 2019-07-05 18:29:38 UTC

Created attachment 11888 [details]
Baojun Wang's reproducer from the mailing list thread

Comment 2 Carlos O'Donell 2019-07-05 20:11:39 UTC

Is it fair to say that registered destructors must run in the same namespace they were registered under?

We need to record the namespace at the point of registration, and use it again when the destructor is run.

Running the destructor in the namespace that was present at registration ensures that the correct free (and any other calls) is used.

Comment 3 Florian Weimer 2019-07-05 20:20:04 UTC

For code outside of ld.so, the relevant namespace is always implied by the code address, so I don't think that's the problem here. 

We probably cannot use malloc for allocating the internal data structures.  If we keep using it, we need to call through a wrapper function in GLRO (or ld.so in general), to reach into the right namespace.

The user-supplied destructor function is not necessarily in the same namespace as the internal allocation, either.

Comment 4 Carlos O'Donell 2019-07-05 21:24:35 UTC

(In reply to Florian Weimer from comment #3)
> We probably cannot use malloc for allocating the internal data structures. 
> If we keep using it, we need to call through a wrapper function in GLRO (or
> ld.so in general), to reach into the right namespace.

My opinion is that we may be able to take a hybrid approach to the problem.

I believe that for certain parts of rtld we can and should use an internal allocator that is isolated from all the allocators in the namespaces. For the purpose of consistency and accountability this allocator cannot be interposed.
These allocations are almost always related to book-keeping tasks.

For other parts of rtld where rtld acts on *behalf* of another API request, it seems logical to try and use the allocator from the namespace, tag the allocation with namespace information, and then release it back using the same namespace.

Back to the question at hand. If the user supplied destructor calls malloc/free then as long as we run it in the original namespace the objects will be correctly  allocated and released in the right place. If the user supplied destructor calls a shared-rtld function (directly or indirectly) which returns a pointer that the destructor must free, then the underlying shared-rtld functions must be of the kind I describe above as "acts on *behalf* of another API" and so must also come from the enclosing namespace.

One way to solve this is:
* Consider the space used by pthread_setspecific to be an "implementation detail" and use a new rtld_calloc() function to allocate it from rtld.
* Subsequent frees in __nptl_deallocate_tsd use rtld_free().

So we invert the problem and move the allocations to the shared rtld namespace, rather than recording where they came from to free them correctly (possibly costly).

Comment 5 Florian Weimer 2019-07-08 09:41:54 UTC

(In reply to Carlos O'Donell from comment #4)
> Back to the question at hand. If the user supplied destructor calls
> malloc/free then as long as we run it in the original namespace the objects

Note that there is no concept of “running code in a namespace”.  The code *is* the namespace (at least until we gain support for a FDPIC architecture, where things could be different).

> will be correctly  allocated and released in the right place. If the user
> supplied destructor calls a shared-rtld function (directly or indirectly)
> which returns a pointer that the destructor must free, then the underlying
> shared-rtld functions must be of the kind I describe above as "acts on
> *behalf* of another API" and so must also come from the enclosing namespace.

True, but I think we should just not add such a function that uses the generic free.  There can only be one rtld, so performing tasks in rtld in the non-base namespace will always be problematic.  See what we have to do with dlerror to make things work properly.

> One way to solve this is:
> * Consider the space used by pthread_setspecific to be an "implementation
> detail" and use a new rtld_calloc() function to allocate it from rtld.
> * Subsequent frees in __nptl_deallocate_tsd use rtld_free().

Another approach would introduce per-link map thread destructors.  Which one is better is hard to tell.  There are likely interactions between dlmopen and threads that are much harder to solve anyway.

Comment 6 wangbj@gmail.com 2019-07-26 15:11:54 UTC

Sorry for hijacking the thread. I guess the issue might related to ld-linux.so's malloc. I know ld-linux.so itself can call *malloc* families (including free, realloc..), even as a freestanding object. but it also makes *malloc* (families) a weak symbol:


$ readelf -s /lib/x86_64-linux-gnu/ld-2.27.so | grep tls_get_addr
    27: 00000000000199f0    61 FUNC    GLOBAL DEFAULT   12 __tls_get_addr@@GLIBC_2.3

$ readelf -s /lib/x86_64-linux-gnu/ld-2.27.so  | grep WEAK
     9: 000000000001b620    52 FUNC    WEAK   DEFAULT   12 free@@GLIBC_2.2.5
    10: 000000000001b7e0   135 FUNC    WEAK   DEFAULT   12 realloc@@GLIBC_2.2.5
    17: 000000000001b5e0    56 FUNC    WEAK   DEFAULT   12 calloc@@GLIBC_2.2.5
    29: 000000000001b4b0   292 FUNC    WEAK   DEFAULT   12 malloc@@GLIBC_2.2.5


Now that even `dlmopen` is called, there's only a single copy of `ld-linux.so`, so when the new linker namespace calls `__tls_get_addr`, which in turn could call `malloc`, ld-linux.so resolves the `malloc` into the default namespace's malloc. because `malloc@@GLIBC_2.2.5` could have been resolved by ld-linux.so already previously.

So it looks to me the isolation is not strong enough: it is very easy to jail break the isolation when TLS is involved. wouldn't ld-linux.so provide malloc families as WEAK symbol, the isolation could be stronger. Hence IMHO, ld-linux.so should not try to re-resolve the malloc family symbols.

Back to my issue attached by Florian, my program got stack overflowed by 500+MB:

    frame #40114: 0x00007ffff7deea28 ld-linux-x86-64.so.2`__tls_get_addr at tls_get_addr.S:55
    frame #40115: 0x00007ffff7ef0fa2   // <---- DSO in new linker ns calls __tls_get_addr, via std::io::_eprintln (rust code)
    frame #40116: 0x00007ffff764e90a libc.so.6`arena_get2(size=576, avoid_arena=0x00007ffff6338770) at arena.c:888 // <----- calls get_nprocs
    frame #40117: 0x00007ffff765354d libc.so.6`tcache_init at arena.c:879
    frame #40118: 0x00007ffff7653530 libc.so.6`tcache_init at malloc.c:2986
    frame #40119: 0x00007ffff76541cb libc.so.6`__GI___libc_malloc at malloc.c:2983
    frame #40120: 0x00007ffff76541b0 libc.so.6`__GI___libc_malloc(bytes=160) at malloc.c:3042. // <---- resolved to default namespace instead of the new one.
    frame #40121: 0x00007ffff7de7b90 ld-linux-x86-64.so.2`tls_get_addr_tail at dl-tls.c:594
    frame #40122: 0x00007ffff7de7b6c ld-linux-x86-64.so.2`tls_get_addr_tail at dl-tls.c:607
    frame #40123: 0x00007ffff7de7b5e ld-linux-x86-64.so.2`tls_get_addr_tail(ti=0x00007ffff7f227e0, dtv=0x000000000060a090, the_map=0x0000000000602360) at dl-tls.c:787
    frame #40124: 0x00007ffff7deea28 ld-linux-x86-64.so.2`__tls_get_addr at tls_get_addr.S:55
    frame #40125: 0x00007ffff7ef0fa2


From the call stack you can see __tls_get_addr -> __tls_get_addr_tail -> malloc resolved into the default namespace's malloc (sorry might not obvious to you guys because the full /proc/<pid>/maps is rather long), the malloc calls to arena_get2 -> get_nprocs, get_nrpocs might call SYS_open, trying to get the information from process. Now that's an issue for me, because my DSO (in the new linker ns) actually intercept `SYS_open`, and doing something like `printf` (or std::io::_eprint in rust):

000000000001af50 <std::io::stdio::_eprint>:
   1af50:       53                      push   %rbx
   1af51:       48 81 ec c0 00 00 00    sub    $0xc0,%rsp
   1af58:       0f 10 07                movups (%rdi),%xmm0
   1af5b:       0f 10 4f 10             movups 0x10(%rdi),%xmm1
   1af5f:       0f 10 57 20             movups 0x20(%rdi),%xmm2
   1af63:       0f 29 94 24 a0 00 00    movaps %xmm2,0xa0(%rsp)
   1af6a:       00 
   1af6b:       0f 29 8c 24 90 00 00    movaps %xmm1,0x90(%rsp)
   1af72:       00 
   1af73:       0f 29 84 24 80 00 00    movaps %xmm0,0x80(%rsp)
   1af7a:       00 
   1af7b:       48 8d 05 d5 fa 01 00    lea    0x1fad5(%rip),%rax        # 3aa57 <str.2+0x5b7>
   1af82:       48 89 84 24 b0 00 00    mov    %rax,0xb0(%rsp)
   1af89:       00 
   1af8a:       48 c7 84 24 b8 00 00    movq   $0x6,0xb8(%rsp)
   1af91:       00 06 00 00 00 
   1af96:       48 8d 3d 43 18 03 00    lea    0x31843(%rip),%rdi        # 4c7e0 <_GLOBAL_OFFSET_TABLE_+0x80>
   1af9d:       e8 de d0 fe ff          callq  8080 <__tls_get_addr@plt>
   1afa2:       48 83 b8 40 00 00 00    cmpq   $0x1,0x40(%rax).  // <--- 0x00007ffff7ef0fa2 from the stack frame.

Which caused the huge stack overflow. That's the reason why patching `get_nprocs` solved my issue, I'd have thought. As mentioned earlier, if ld-linux.so resolved `malloc` to the new linker namespace version, it wouldn't have any issue (because my DSO only intercepts syscalls from *default* namespace).

It may sounds like this particular issue only applies to my use case, but I really think the WEAK malloc families symbols really makes the isolation much weaker, do you guys think it would be a reasonable request to make the malloc families symbols from ld-linux.so PRIVATE?

Comment 7 wangbj@gmail.com 2019-07-26 15:33:12 UTC

It could be even more nuanced than that: __tls_get_addr could cause mallocing memory from the default namespace, while free from the new namespace would end up freeing memory allocated from the default namespace, resulting memory corruption, I've seen this happen sporadically, though haven't found a solid instance it indeed happened, as neither gdb nor lldb is friendly to core dumps with multiple linker namespace.

Comment 8 Florian Weimer 2019-07-26 15:38:02 UTC

(In reply to wangbj@gmail.com from comment #7)
> It could be even more nuanced than that: __tls_get_addr could cause
> mallocing memory from the default namespace, while free from the new
> namespace would end up freeing memory allocated from the default namespace,
> resulting memory corruption, I've seen this happen sporadically, though
> haven't found a solid instance it indeed happened, as neither gdb nor lldb
> is friendly to core dumps with multiple linker namespace.

This is exactly what happens if the other (easier) issues are fixed, and the scope of this bug report.  It's currently not easy to see this because of the other issues (bug 24772, bug 24773, if I recall correctly).

Comment 9 Florian Weimer 2021-05-31 18:10:22 UTC

Bug 26955 raises the related issue that there is just mapping per thread from key to data, but the data structure that maintains the keys is duplicated if libc.so is loaded more than once.

Comment 10 Florian Weimer 2021-05-31 18:10:44 UTC

*** Bug 26955 has been marked as a duplicate of this bug. ***

Comment 11 Oleksii Shevchuk 2021-08-08 05:15:24 UTC

/tmp > cat ldr.c 
#define _GNU_SOURCE
#include <dlfcn.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    pthread_key_t key = 0xFFFFFFFF;
    pthread_key_create(&key, NULL);
    pthread_setspecific(key, (void*) 0xDEADBEEF);
    printf("[tid=%lx] Key created: %u with value %p\n", pthread_self(), key, pthread_getspecific(key));

    printf("[New LMID]\n");
    void *ldr2 = dlmopen(LM_ID_NEWLM, "./ldr2.so", RTLD_NOW | RTLD_LOCAL);
    if (!ldr2) {
        printf("ERR: %s\n", dlerror());
        return -1;
    }

    printf("[Base LMID]\n");
    ldr2 = dlmopen(LM_ID_BASE, "./ldr2.so", RTLD_NOW | RTLD_LOCAL);
    if (!ldr2) {
        printf("ERR: %s\n", dlerror());
        return -1;
    }

    return 0;
}


/tmp > cat ldr2.c
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>

void init(int, char*[], char*[]) {
    pthread_key_t key = 0xFFFFFFFF;
    pthread_key_create(&key, NULL);
    printf("[tid=%lx] Key created: %u\n", pthread_self(), key);
    printf("[tid=%lx] Key value: %p\n", pthread_self(), pthread_getspecific(key));
}

__attribute__((section(".init_array"))) void (* __init)(int, char*[], char*[]) = init;


/tmp > gcc -o ldr ldr.c -pthread  -ldl
/tmp > gcc -shared -o ldr2.so ldr2.c -pthread     
/tmp > ./ldr
[tid=7fe649b0a740] Key created: 0 with value 0xdeadbeef
[New LMID]
[tid=7fe649b0a740] Key created: 0
[tid=7fe649b0a740] Key value: 0xdeadbeef
[Base LMID]
[tid=7fe649b0a740] Key created: 2
[tid=7fe649b0a740] Key value: (nil)


Pain..

Comment 12 Reimar Döffinger 2021-08-08 11:06:41 UTC

I think your example is pretty much the same as mine in https://sourceware.org/bugzilla/show_bug.cgi?id=26955
I "fixed" it by using a LD_PRELOAD that provides its own implementation of pthread_key_create etc. that are not entirely correct (lowish fixed limit of keys, no freeing, implemented using the thread_local keyword) but work for my purpose.
Unfortunately dlmopen still seems to be the most realistic solution when one has the misfortune of having a process using Python3 that has a library dependency that requires Python2.

Comment 13 Paul Pluzhnikov 2021-11-20 16:37:22 UTC

There is a more fundamental problem with pthread_key_create from dlmopen'ed copy of libpthread: it creates duplicate keys (since it looks in the local copy of __pthread_keys).

The key is then used as an index into THREAD_SELF->specific_1stblock etc, which is shared between all the namespaces, with a disastrous result.

Real-life example: https://stackoverflow.com/q/70030529

Trivial repro:

// --- cut ---
#define _GNU_SOURCE
#include <assert.h>
#include <dlfcn.h>
#include <pthread.h>

typedef int (*pthread_key_create_t)(pthread_key_t *, void (*)(void*));
int main()
{
  pthread_key_t k1, k2;
  int rc;

  rc = pthread_key_create(&k1, NULL);
  void *h = dlmopen(LM_ID_NEWLM, "libpthread.so.0", RTLD_LAZY);
  assert(h != NULL);

  pthread_key_create_t fn = (pthread_key_create_t)dlsym(h, "pthread_key_create");
  assert(fn != NULL);

  rc = fn(&k2, NULL);
  assert(rc == 0);
  assert(k2 != k1);

  return 0;
}
// --- cut ---

gcc -g t.c -pthread -ldl -Wall -Wextra
./a.out
a.out: t.c:21: main: Assertion `k2 != k1' failed.
Aborted

Comment 14 Florian Weimer 2024-09-03 06:35:01 UTC

*** Bug 32135 has been marked as a duplicate of this bug. ***