Bug 22745

Summary: _nptl_setxid can loop forever if a dlmopen namespace tries to initialise pthreads after the main namespace does
Product: glibc Reporter: Vivek Das Mohapatra <vivek>
Component: dynamic-linkAssignee: Florian Weimer <fweimer>
Status: RESOLVED FIXED    
Severity: normal CC: adhemerval.zanella, carlos, fweimer
Priority: P2 Flags: fweimer: security-
Version: 2.24   
Target Milestone: 2.34   
Host: Target:
Build: Last reconfirmed:
Attachments: Test case - build two executables. One uses dlmopen and triggers the lock up, the other uses dlopen and does not.
Test case - build two executables. One uses dlmopen and triggers the lock up, the other uses dlopen and does not.
Test case - loop-forever has several modes of operation which can trigger (or not) the deadlock

Description Vivek Das Mohapatra 2018-01-24 21:45:26 UTC
Created attachment 10759 [details]
Test case - build two executables. One uses dlmopen and triggers the lock up, the other uses dlopen and does not.

Stumbled open this while testing pulseaudio in conjunction with
dlmopen: pulseaudio seems to lock up very soon after it starts.

A bit of digging with strace and gdb shows that when it locks up
it does so inside setresuid. A bit more digging indicates that the
code is infinite looping here:

__nptl_setxid (cmdp=0xffffd9d8) at allocatestack.c:1105
+list
1103    
1104      /* Now the list with threads using user-allocated stacks.  */
1105      list_for_each (runp, &__stack_user)
1106        {
1107          struct pthread *t = list_entry (runp, struct pthread, list);
1108          if (t == self)
1109            continue;
1110    
1111          setxid_mark_thread (cmdp, t);
1112        }

For some reason, list_for_each never terminates.

If I disable the dlmopen code path then the following holds at that
point in the code:

Breakpoint 6, __nptl_setxid (cmdp=0xffffd9e8) at allocatestack.c:1105
1105	  list_for_each (runp, &__stack_user)
+bt
#0  __nptl_setxid (cmdp=0xffffd9e8) at allocatestack.c:1105
#1  0xf7b96162 in __GI___setresuid (ruid=1000, euid=1000, suid=1000)
      at ../sysdeps/unix/sysv/linux/i386/setresuid.c:29
#2  0x5655b7f0 in pa_drop_root ()
#3  0x56558a6e in main ()

Digging into __stack_user:

+p __stack_user
$1 = {next = 0xf73a48a0, prev = 0xf73a48a0}

+p &__stack_user
$2 = (list_t *) 0xf7d1d1a4 <__stack_user>

+p (&__stack_user)->next
$3 = (struct list_head *) 0xf73a48a0

+p (&__stack_user)->next->next
$4 = (struct list_head *) 0xf7d1d1a4 <__stack_user>

+p (&__stack_user)->next->next->next
$5 = (struct list_head *) 0xf73a48a0

We find a circular linked list, which contains a pointer to __stack_user.
Since list_for_each is invoked as list_for_each(…, &__stack_user),
this means the for loop it implements will terminate, allowing setresuid
to proceed.

// ============================================================================
Note: The definition of list_for_each is this:

# define list_for_each(pos, head) \
  for (pos = (head)->next; pos != (head); pos = pos->next)
// ============================================================================

Now let's examine the same case with the dlmopen call back in place:

Breakpoint 6, __nptl_setxid (cmdp=0xffffd9d8) at allocatestack.c:1105
1105	  list_for_each (runp, &__stack_user)
 ⋮
+p __stack_user
$1 = {next = 0xf76eeb60, prev = 0xf76eeb60}

+p &__stack_user
$2 = (list_t *) 0xf7d8f1a4 <__stack_user>

+p (&__stack_user)->next
$3 = (struct list_head *) 0xf76eeb60

+p (&__stack_user)->next->next
$4 = (struct list_head *) 0xf71391a4

+p (&__stack_user)->next->next->next
$5 = (struct list_head *) 0xf76eeb60

We can see we have a circular linked list, as before, but it does
_not_ contain the element supplied as the head to list_for_each:
We're going to loop forever.

============================================================================

Next let's try and figure out when/where this happens.
Setting various breakpoints and watches we uncover the following:

+run
Starting program: /usr/bin/pulseaudio --start
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".

Breakpoint 1, __pthread_initialize_minimal_internal () at nptl-init.c:290
290	{
+break allocatestack.c:1105
Breakpoint 6 at 0xf7d78b2c: file allocatestack.c, line 1105.
+watch __stack_user
Hardware watchpoint 7: __stack_user
+watch __stack_user.next
Hardware watchpoint 8: __stack_user.next
+cont
Continuing.

Hardware watchpoint 7: __stack_user

Old value = {next = 0x0, prev = 0x0}
New value = {next = 0xf7d8f1a4 <__stack_user>, prev = 0x0}

Hardware watchpoint 8: __stack_user.next

Old value = (struct list_head *) 0x0
New value = (struct list_head *) 0xf7d8f1a4 <__stack_user>
__pthread_initialize_minimal_internal () at nptl-init.c:377
377	  list_add (&pd->list, &__stack_user);
+cont
Continuing.

Hardware watchpoint 7: __stack_user

Old value = {next = 0xf7d8f1a4 <__stack_user>, prev = 0x0}
New value = {next = 0xf7d8f1a4 <__stack_user>, prev = 0xf76eeb60}
list_add (head=<optimized out>, newp=0xf76eeb60) at ../include/list.h:64
64	  head->next = newp;
+cont
Continuing.

Hardware watchpoint 7: __stack_user

Old value = {next = 0xf7d8f1a4 <__stack_user>, prev = 0xf76eeb60}
New value = {next = 0xf76eeb60, prev = 0xf76eeb60}

Hardware watchpoint 8: __stack_user.next

Old value = (struct list_head *) 0xf7d8f1a4 <__stack_user>
New value = (struct list_head *) 0xf76eeb60
__pthread_initialize_minimal_internal () at nptl-init.c:381
381	  THREAD_SETMEM (pd, report_events, __nptl_initial_report_events);
+cont
Continuing.

Breakpoint 2, __pthread_init_static_tls (map=0x5657e040) at allocatestack.c:1210
1210	{

// ============================================================================
// At this point we step to the end of __pthread_init_static_tls and set
// an extra watch point on the address currently holding &__stack_user
// ============================================================================

+p __stack_user.next
$1 = (struct list_head *) 0xf76eeb60

+p __stack_user.next->next
$2 = (struct list_head *) 0xf7d8f1a4 <__stack_user>  ← STILL GOOD

+watch __stack_user.next->next
Hardware watchpoint 9: __stack_user.next->next
+s

// And here it is: 
Hardware watchpoint 9: __stack_user.next->next

Old value = (struct list_head *) 0xf7d8f1a4 <__stack_user>
New value = (struct list_head *) 0xf71391a4 ← >>>>> GONE WRONG HERE <<<<<
0xf7121c83 in ?? ()

// Hm, an unknown address scribbling on __stack_user.

+call calloc(1, sizeof(Dl_info))
$3 = (void *) 0x56574d18
+call dladdr(0xf7121c83, $3)
$4 = 1

+p *(Dl_info *)$3
$5 = {dli_fname = 0x565755b8 "/lib/i386-linux-gnu/libpthread.so.0",
      dli_fbase = 0xf711d000,
      dli_sname = 0xf711f617 "__pthread_initialize_minimal",
      dli_saddr = 0xf7121be0}

// Well that can't be right, can it? gdb should have figured out the name
// of 0xf7121c83, not said ?? - let's work out the address in the other
// direction:

+p __pthread_initialize_minimal
$6 = {<text variable, no debug info>} 0xf7d77be0
      <__pthread_initialize_minimal_internal>

+call dladdr(0xf7d77be0, $3)
$8 = 1

+p *(Dl_info *)$3
$10 = {dli_fname = 0xf7fd4d70 "/lib/i386-linux-gnu/libpthread.so.0",
       dli_fbase = 0xf7d73000,
       dli_sname = 0xf7d75617 "__pthread_initialize_minimal", 
       dli_saddr = 0xf7d77be0 <__pthread_initialize_minimal_internal>}

// ============================================================================

Aha! Same DSO, different base address. So the ?? instance of
__pthread_initialize_minimal_internal was from the _other_ copy of libc,
inside the dlmopen namespace - the one gdb doesn't know how to inspect.

PS: for completeness, I went back and followed the __stack_user linked list
at the "GONE WRONG HERE" point, just to be sure:

+p __stack_user
$1 = {next = 0xf76eeb60, prev = 0xf76eeb60}

+p __stack_user.next
$2 = (struct list_head *) 0xf76eeb60

+p __stack_user.next->next
$3 = (struct list_head *) 0xf71391a4

+p __stack_user.next->next->next
$4 = (struct list_head *) 0xf71391a4

+p __stack_user.next->next->next->next
$5 = (struct list_head *) 0xf71391a4

So the linked list definitely doesn't contain &__stack_user any more.

// ============================================================================

Apologies for the exegesis: It seems to me that the copy of libc in the
private namespace has somehow managed to scribble on the linked list
pointed to by __stack_user, overwriting a key address.

Is my analysis correct? Is there something I could or should have done to
avoid this?

A while ago (https://sourceware.org/ml/libc-help/2018-01/msg00002.html)
I suggested a dlmopen flag RTLD_UNIQUE or similar which would cause the
existing mapping of the target library in the main namespace/link-map to be
re-used instead of creating a new one: I believe this would prevent this
problem (and others detailed in that message) from occurring - any thoughts?
Comment 1 Carlos O'Donell 2018-01-24 22:17:34 UTC
dlmopen needs fixing to keep the implementation in a consistent state with regards to threads, and that probably means keeping the implementation set of libraries the same across all dlmopen namespaces e.g. the same ld.so, libc.so, libpthread.so, etc.
Comment 2 Florian Weimer 2018-01-25 10:04:28 UTC
It is not libc, but libpthread, which contains __stack_user, but your analysis is essentially correct.

All data which is directly related to the TCB needs to be shared across dlmopen boundaries, but we currently do not implement that.

Blanket sharing of all of libc and its related libraries is not what everyone wants.  LD_AUDIT modules tend to rely on a separate libc, for instance.
Comment 3 Vivek Das Mohapatra 2018-01-25 14:58:53 UTC
My initial plan is to implement, if I can, the RTLD_UNIQUE flag I proposed
which would make the targeted namespace re-use the mapping already in the
main link map - I think that's got to be the basis for whatever the final
behaviour and semantics turn out to be.

Before I sink a lot of time into this, does that sound reasonable?
Comment 4 Carlos O'Donell 2018-01-26 00:20:59 UTC
(In reply to Vivek Das Mohapatra from comment #3)
> My initial plan is to implement, if I can, the RTLD_UNIQUE flag I proposed
> which would make the targeted namespace re-use the mapping already in the
> main link map - I think that's got to be the basis for whatever the final
> behaviour and semantics turn out to be.
> 
> Before I sink a lot of time into this, does that sound reasonable?

No, and I wrote it out in a wiki page with some notes about why.

We should try to flesh out partial isolation first, but must keep full isolation working for LD_AUDIT.

https://sourceware.org/glibc/wiki/LinkerNamespaces
Comment 5 Vivek Das Mohapatra 2018-01-26 01:15:01 UTC
(In reply to Carlos O'Donell from comment #4)
> We should try to flesh out partial isolation first, but must keep full
> isolation working for LD_AUDIT.
> 
> https://sourceware.org/glibc/wiki/LinkerNamespaces

I take your point(s) about the limitations of RTLD_UNIQUE - my thought was
that to start with I could use it for libc/libpthread (by using the flag
explicitly in libcapsule) which don't have further dependencies - everything
else I need to share is/would be handled by creating a shim library which 
managed data sharing at a higher level.

The default behaviour would still be total isolation, as happens now.
Comment 6 Carlos O'Donell 2018-01-26 05:03:02 UTC
(In reply to Vivek Das Mohapatra from comment #5)
> (In reply to Carlos O'Donell from comment #4)
> > We should try to flesh out partial isolation first, but must keep full
> > isolation working for LD_AUDIT.
> > 
> > https://sourceware.org/glibc/wiki/LinkerNamespaces
> 
> I take your point(s) about the limitations of RTLD_UNIQUE - my thought was
> that to start with I could use it for libc/libpthread (by using the flag
> explicitly in libcapsule) which don't have further dependencies - everything
> else I need to share is/would be handled by creating a shim library which 
> managed data sharing at a higher level.
> 
> The default behaviour would still be total isolation, as happens now.

Consider the case where a new library in a new namespace dlopen's libpthread to create new threads. This dlopen will not be made with RTLD_UNIQUE and these new threads will be unknown to the base namespace. Instead the base namespace should load libpthread and make it available in the non-base namespace. This way the base namespace is aware that there are threads now, and that seteuid must iterate over them properly. I expect we need a SONAME list and all load's from that SONAME list must first happen in the base namespace, and then be linked into the non-base namespace. The SONAME list is minimally libc.so.6, libpthread.so.0, libdl.so.2, libgcc_s.so.1 (from gcc for unwinding), with maybe libm.so.6 for future proofing, basically anything that constitutes the implementation. This filtered loading should be turned off for total isolation namespaces required for LD_AUDIT using an internal dlopen flag.
Comment 7 Vivek Das Mohapatra 2018-01-30 17:45:59 UTC
Looking into this approach: Will probably be a little while before I have something concrete. I will make sure not to trample on LD_AUDIT.
Comment 8 Vivek Das Mohapatra 2018-02-09 19:34:38 UTC
So, staring at this code for a while - will the following approach to 
cloning an object into a secondary link map work:

 - allocate a new entry in the same way as _dl_new_object
 - memcpy the contents of the old entry to the new
 - set the next and previous pointers to NULL in the new struct
 - hook the next and prev pointers up to the new link map list

The existing clone operation for ld.so seems to be a special case
which pokes the real pointer into l_real in a freshly allocated
struct, but doesn't care much about the details as the linker can't
depend on anything and nothing relocates symbols from it (I think?)

[I realise I haven't addressed the dont-apply-uniqueness-to-audit
 requirement yet - will look at that once I know how to make it 
 work at all]
Comment 9 Vivek Das Mohapatra 2018-03-07 16:59:34 UTC
Never mind, figured it out. Have a working proof of concept shared-link-map-entry implementation, now to make it audit-safe.
Comment 10 Vivek Das Mohapatra 2018-04-05 14:28:17 UTC
Created attachment 10935 [details]
Test case - build two executables. One uses dlmopen and triggers the lock up, the other uses dlopen and does not.

usage: ./loop-forever
  You may set any or all of the following env vars:

  BLOCK=0 ./loop-forever # don't try to trigger the deadlock
  SHARE=0 ./loop-forever # don't try to use RTLD_SHARED
  DEBUG=1 ./loop-forever # print extra debug messages
Comment 11 Vivek Das Mohapatra 2018-04-12 14:59:20 UTC
RFC proof-of-concept implementation sent to list at:

https://sourceware.org/ml/libc-help/2018-04/threads.html#00004
Comment 12 Adhemerval Zanella 2018-04-16 20:26:03 UTC
I would advise you to post it on libc-alpha [1], since it is the usual
place for patch discussion.

[1] https://sourceware.org/ml/libc-alpha/
Comment 13 Vivek Das Mohapatra 2018-04-24 16:28:05 UTC
Sent to libc-alpha.
Comment 14 Vivek Das Mohapatra 2018-06-06 19:05:03 UTC
Created attachment 11058 [details]
Test case - loop-forever has several modes of operation which can trigger (or not) the deadlock

  usage: ./loop-forever
    You may set any or all of the following env vars:

    BLOCK=0 ./loop-forever # don't try to trigger the deadlock
    SHARE=0 ./loop-forever # don't try to use RTLD_SHARED
    DEBUG=1 ./loop-forever # print extra debug messages
Comment 15 Florian Weimer 2022-04-06 10:25:24 UTC
I believe we fixed this in glibc 2.34 with this commit:

commit 90d7e7e5bd3b0683a27c658388b6515ce950c78e
Author: Florian Weimer <fweimer@redhat.com>
Date:   Wed Apr 21 19:49:51 2021 +0200

    elf: Introduce __tls_init_tp for second-phase TCB initialization
    
    TLS_INIT_TP is processor-specific, so it is not a good place to
    put thread library initialization code (it would have to be repeated
    for all CPUs).  Introduce __tls_init_tp as a separate function,
    to be called immediately after TLS_INIT_TP.  Move the existing
    stack list setup code for NPTL to this function.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

It depends on GL (dl_stack_user), which was introduced by:

commit 1daccf403b1bd86370eb94edca794dc106d02039
Author: Florian Weimer <fweimer@redhat.com>
Date:   Mon Nov 16 19:33:30 2020 +0100

    nptl: Move stack list variables into _rtld_global
    
    Now __thread_gscope_wait (the function behind THREAD_GSCOPE_WAIT,
    formerly __wait_lookup_done) can be implemented directly in ld.so,
    eliminating the unprotected GL (dl_wait_lookup_done) function
    pointer.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

Plus some other changes that eventually eliminate late libpthread initialization. With late initialization completely gone, dlmopen or static dlmopen no longer clobbers the TCB.