pthread_key_create and pthread_setspecific use the malloc of the respective namespace. The memory is freed in __nptl_deallocate_tsd, again using the malloc of the namespace. These can be different if the base namespace creates a thread, another namespace calls pthread_setspecific, and then the thread exits (by returning from the start function, so via the base namespace). After some heroic patching, this is where the example in <https://sourceware.org/ml/libc-help/2019-06/msg00026.html> fails, and that part will be difficult to address.
Created attachment 11888 [details] Baojun Wang's reproducer from the mailing list thread
Is it fair to say that registered destructors must run in the same namespace they were registered under? We need to record the namespace at the point of registration, and use it again when the destructor is run. Running the destructor in the namespace that was present at registration ensures that the correct free (and any other calls) is used.
For code outside of ld.so, the relevant namespace is always implied by the code address, so I don't think that's the problem here. We probably cannot use malloc for allocating the internal data structures. If we keep using it, we need to call through a wrapper function in GLRO (or ld.so in general), to reach into the right namespace. The user-supplied destructor function is not necessarily in the same namespace as the internal allocation, either.
(In reply to Florian Weimer from comment #3) > We probably cannot use malloc for allocating the internal data structures. > If we keep using it, we need to call through a wrapper function in GLRO (or > ld.so in general), to reach into the right namespace. My opinion is that we may be able to take a hybrid approach to the problem. I believe that for certain parts of rtld we can and should use an internal allocator that is isolated from all the allocators in the namespaces. For the purpose of consistency and accountability this allocator cannot be interposed. These allocations are almost always related to book-keeping tasks. For other parts of rtld where rtld acts on *behalf* of another API request, it seems logical to try and use the allocator from the namespace, tag the allocation with namespace information, and then release it back using the same namespace. Back to the question at hand. If the user supplied destructor calls malloc/free then as long as we run it in the original namespace the objects will be correctly allocated and released in the right place. If the user supplied destructor calls a shared-rtld function (directly or indirectly) which returns a pointer that the destructor must free, then the underlying shared-rtld functions must be of the kind I describe above as "acts on *behalf* of another API" and so must also come from the enclosing namespace. One way to solve this is: * Consider the space used by pthread_setspecific to be an "implementation detail" and use a new rtld_calloc() function to allocate it from rtld. * Subsequent frees in __nptl_deallocate_tsd use rtld_free(). So we invert the problem and move the allocations to the shared rtld namespace, rather than recording where they came from to free them correctly (possibly costly).
(In reply to Carlos O'Donell from comment #4) > Back to the question at hand. If the user supplied destructor calls > malloc/free then as long as we run it in the original namespace the objects Note that there is no concept of “running code in a namespace”. The code *is* the namespace (at least until we gain support for a FDPIC architecture, where things could be different). > will be correctly allocated and released in the right place. If the user > supplied destructor calls a shared-rtld function (directly or indirectly) > which returns a pointer that the destructor must free, then the underlying > shared-rtld functions must be of the kind I describe above as "acts on > *behalf* of another API" and so must also come from the enclosing namespace. True, but I think we should just not add such a function that uses the generic free. There can only be one rtld, so performing tasks in rtld in the non-base namespace will always be problematic. See what we have to do with dlerror to make things work properly. > One way to solve this is: > * Consider the space used by pthread_setspecific to be an "implementation > detail" and use a new rtld_calloc() function to allocate it from rtld. > * Subsequent frees in __nptl_deallocate_tsd use rtld_free(). Another approach would introduce per-link map thread destructors. Which one is better is hard to tell. There are likely interactions between dlmopen and threads that are much harder to solve anyway.
Sorry for hijacking the thread. I guess the issue might related to ld-linux.so's malloc. I know ld-linux.so itself can call *malloc* families (including free, realloc..), even as a freestanding object. but it also makes *malloc* (families) a weak symbol: $ readelf -s /lib/x86_64-linux-gnu/ld-2.27.so | grep tls_get_addr 27: 00000000000199f0 61 FUNC GLOBAL DEFAULT 12 __tls_get_addr@@GLIBC_2.3 $ readelf -s /lib/x86_64-linux-gnu/ld-2.27.so | grep WEAK 9: 000000000001b620 52 FUNC WEAK DEFAULT 12 free@@GLIBC_2.2.5 10: 000000000001b7e0 135 FUNC WEAK DEFAULT 12 realloc@@GLIBC_2.2.5 17: 000000000001b5e0 56 FUNC WEAK DEFAULT 12 calloc@@GLIBC_2.2.5 29: 000000000001b4b0 292 FUNC WEAK DEFAULT 12 malloc@@GLIBC_2.2.5 Now that even `dlmopen` is called, there's only a single copy of `ld-linux.so`, so when the new linker namespace calls `__tls_get_addr`, which in turn could call `malloc`, ld-linux.so resolves the `malloc` into the default namespace's malloc. because `malloc@@GLIBC_2.2.5` could have been resolved by ld-linux.so already previously. So it looks to me the isolation is not strong enough: it is very easy to jail break the isolation when TLS is involved. wouldn't ld-linux.so provide malloc families as WEAK symbol, the isolation could be stronger. Hence IMHO, ld-linux.so should not try to re-resolve the malloc family symbols. Back to my issue attached by Florian, my program got stack overflowed by 500+MB: frame #40114: 0x00007ffff7deea28 ld-linux-x86-64.so.2`__tls_get_addr at tls_get_addr.S:55 frame #40115: 0x00007ffff7ef0fa2 // <---- DSO in new linker ns calls __tls_get_addr, via std::io::_eprintln (rust code) frame #40116: 0x00007ffff764e90a libc.so.6`arena_get2(size=576, avoid_arena=0x00007ffff6338770) at arena.c:888 // <----- calls get_nprocs frame #40117: 0x00007ffff765354d libc.so.6`tcache_init at arena.c:879 frame #40118: 0x00007ffff7653530 libc.so.6`tcache_init at malloc.c:2986 frame #40119: 0x00007ffff76541cb libc.so.6`__GI___libc_malloc at malloc.c:2983 frame #40120: 0x00007ffff76541b0 libc.so.6`__GI___libc_malloc(bytes=160) at malloc.c:3042. // <---- resolved to default namespace instead of the new one. frame #40121: 0x00007ffff7de7b90 ld-linux-x86-64.so.2`tls_get_addr_tail at dl-tls.c:594 frame #40122: 0x00007ffff7de7b6c ld-linux-x86-64.so.2`tls_get_addr_tail at dl-tls.c:607 frame #40123: 0x00007ffff7de7b5e ld-linux-x86-64.so.2`tls_get_addr_tail(ti=0x00007ffff7f227e0, dtv=0x000000000060a090, the_map=0x0000000000602360) at dl-tls.c:787 frame #40124: 0x00007ffff7deea28 ld-linux-x86-64.so.2`__tls_get_addr at tls_get_addr.S:55 frame #40125: 0x00007ffff7ef0fa2 From the call stack you can see __tls_get_addr -> __tls_get_addr_tail -> malloc resolved into the default namespace's malloc (sorry might not obvious to you guys because the full /proc/<pid>/maps is rather long), the malloc calls to arena_get2 -> get_nprocs, get_nrpocs might call SYS_open, trying to get the information from process. Now that's an issue for me, because my DSO (in the new linker ns) actually intercept `SYS_open`, and doing something like `printf` (or std::io::_eprint in rust): 000000000001af50 <std::io::stdio::_eprint>: 1af50: 53 push %rbx 1af51: 48 81 ec c0 00 00 00 sub $0xc0,%rsp 1af58: 0f 10 07 movups (%rdi),%xmm0 1af5b: 0f 10 4f 10 movups 0x10(%rdi),%xmm1 1af5f: 0f 10 57 20 movups 0x20(%rdi),%xmm2 1af63: 0f 29 94 24 a0 00 00 movaps %xmm2,0xa0(%rsp) 1af6a: 00 1af6b: 0f 29 8c 24 90 00 00 movaps %xmm1,0x90(%rsp) 1af72: 00 1af73: 0f 29 84 24 80 00 00 movaps %xmm0,0x80(%rsp) 1af7a: 00 1af7b: 48 8d 05 d5 fa 01 00 lea 0x1fad5(%rip),%rax # 3aa57 <str.2+0x5b7> 1af82: 48 89 84 24 b0 00 00 mov %rax,0xb0(%rsp) 1af89: 00 1af8a: 48 c7 84 24 b8 00 00 movq $0x6,0xb8(%rsp) 1af91: 00 06 00 00 00 1af96: 48 8d 3d 43 18 03 00 lea 0x31843(%rip),%rdi # 4c7e0 <_GLOBAL_OFFSET_TABLE_+0x80> 1af9d: e8 de d0 fe ff callq 8080 <__tls_get_addr@plt> 1afa2: 48 83 b8 40 00 00 00 cmpq $0x1,0x40(%rax). // <--- 0x00007ffff7ef0fa2 from the stack frame. Which caused the huge stack overflow. That's the reason why patching `get_nprocs` solved my issue, I'd have thought. As mentioned earlier, if ld-linux.so resolved `malloc` to the new linker namespace version, it wouldn't have any issue (because my DSO only intercepts syscalls from *default* namespace). It may sounds like this particular issue only applies to my use case, but I really think the WEAK malloc families symbols really makes the isolation much weaker, do you guys think it would be a reasonable request to make the malloc families symbols from ld-linux.so PRIVATE?
It could be even more nuanced than that: __tls_get_addr could cause mallocing memory from the default namespace, while free from the new namespace would end up freeing memory allocated from the default namespace, resulting memory corruption, I've seen this happen sporadically, though haven't found a solid instance it indeed happened, as neither gdb nor lldb is friendly to core dumps with multiple linker namespace.
(In reply to wangbj@gmail.com from comment #7) > It could be even more nuanced than that: __tls_get_addr could cause > mallocing memory from the default namespace, while free from the new > namespace would end up freeing memory allocated from the default namespace, > resulting memory corruption, I've seen this happen sporadically, though > haven't found a solid instance it indeed happened, as neither gdb nor lldb > is friendly to core dumps with multiple linker namespace. This is exactly what happens if the other (easier) issues are fixed, and the scope of this bug report. It's currently not easy to see this because of the other issues (bug 24772, bug 24773, if I recall correctly).
Bug 26955 raises the related issue that there is just mapping per thread from key to data, but the data structure that maintains the keys is duplicated if libc.so is loaded more than once.
*** Bug 26955 has been marked as a duplicate of this bug. ***
/tmp > cat ldr.c #define _GNU_SOURCE #include <dlfcn.h> #include <pthread.h> #include <stdio.h> #include <unistd.h> int main() { pthread_key_t key = 0xFFFFFFFF; pthread_key_create(&key, NULL); pthread_setspecific(key, (void*) 0xDEADBEEF); printf("[tid=%lx] Key created: %u with value %p\n", pthread_self(), key, pthread_getspecific(key)); printf("[New LMID]\n"); void *ldr2 = dlmopen(LM_ID_NEWLM, "./ldr2.so", RTLD_NOW | RTLD_LOCAL); if (!ldr2) { printf("ERR: %s\n", dlerror()); return -1; } printf("[Base LMID]\n"); ldr2 = dlmopen(LM_ID_BASE, "./ldr2.so", RTLD_NOW | RTLD_LOCAL); if (!ldr2) { printf("ERR: %s\n", dlerror()); return -1; } return 0; } /tmp > cat ldr2.c #define _GNU_SOURCE #include <pthread.h> #include <stdio.h> void init(int, char*[], char*[]) { pthread_key_t key = 0xFFFFFFFF; pthread_key_create(&key, NULL); printf("[tid=%lx] Key created: %u\n", pthread_self(), key); printf("[tid=%lx] Key value: %p\n", pthread_self(), pthread_getspecific(key)); } __attribute__((section(".init_array"))) void (* __init)(int, char*[], char*[]) = init; /tmp > gcc -o ldr ldr.c -pthread -ldl /tmp > gcc -shared -o ldr2.so ldr2.c -pthread /tmp > ./ldr [tid=7fe649b0a740] Key created: 0 with value 0xdeadbeef [New LMID] [tid=7fe649b0a740] Key created: 0 [tid=7fe649b0a740] Key value: 0xdeadbeef [Base LMID] [tid=7fe649b0a740] Key created: 2 [tid=7fe649b0a740] Key value: (nil) Pain..
I think your example is pretty much the same as mine in https://sourceware.org/bugzilla/show_bug.cgi?id=26955 I "fixed" it by using a LD_PRELOAD that provides its own implementation of pthread_key_create etc. that are not entirely correct (lowish fixed limit of keys, no freeing, implemented using the thread_local keyword) but work for my purpose. Unfortunately dlmopen still seems to be the most realistic solution when one has the misfortune of having a process using Python3 that has a library dependency that requires Python2.
There is a more fundamental problem with pthread_key_create from dlmopen'ed copy of libpthread: it creates duplicate keys (since it looks in the local copy of __pthread_keys). The key is then used as an index into THREAD_SELF->specific_1stblock etc, which is shared between all the namespaces, with a disastrous result. Real-life example: https://stackoverflow.com/q/70030529 Trivial repro: // --- cut --- #define _GNU_SOURCE #include <assert.h> #include <dlfcn.h> #include <pthread.h> typedef int (*pthread_key_create_t)(pthread_key_t *, void (*)(void*)); int main() { pthread_key_t k1, k2; int rc; rc = pthread_key_create(&k1, NULL); void *h = dlmopen(LM_ID_NEWLM, "libpthread.so.0", RTLD_LAZY); assert(h != NULL); pthread_key_create_t fn = (pthread_key_create_t)dlsym(h, "pthread_key_create"); assert(fn != NULL); rc = fn(&k2, NULL); assert(rc == 0); assert(k2 != k1); return 0; } // --- cut --- gcc -g t.c -pthread -ldl -Wall -Wextra ./a.out a.out: t.c:21: main: Assertion `k2 != k1' failed. Aborted
*** Bug 32135 has been marked as a duplicate of this bug. ***