Bug 17707 - nptl_db: terminated and joined threads
Summary: nptl_db: terminated and joined threads
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-12 18:04 UTC by Pedro Alves
Modified: 2014-12-18 08:50 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pedro Alves 2014-12-12 18:04:06 UTC
I wrote a GDB test that attaches to a program that constantly spawns
short-lived threads, which exposed several issues.  This is one of
them.

GDB sometimes prints out a warning like:

 ...
 [New LWP 20700]
 warning: unable to open /proc file '/proc/-1/status'
 [New LWP 20850]
 [New LWP 21019]
 ...

That happens because when a thread exits, and is joined, glibc does:

nptl/pthread_join.c:
pthread_join ()
{
...
  if (__glibc_likely (result == 0))
    {
      /* We mark the thread as terminated and as joined.  */
      pd->tid = -1;
...
     /* Free the TCB.  */
      __free_tcb (pd);
    }

So if we stop the inferior at just the right time, and list threads
with td_ta_thr_iter / td_thr_get_info, we can find threads with kernel thread ID -1 (td_thrinfo_t.ti_lid == -1).

Unfortunately, td_thrinfo_t.ti_state claims the thread is TD_THR_ACTIVE.
Turns out the set of states td_thr_get_info returns isn't very
complete:

 td_thr_get_info ()
 {
   if ((((int) (uintptr_t) cancelhandling) & EXITING_BITMASK) == 0)
     /* XXX For now there is no way to get more information.  */
     infop->ti_state = TD_THR_ACTIVE;
   else if ((((int) (uintptr_t) cancelhandling) & TERMINATED_BITMASK) == 0)
     infop->ti_state = TD_THR_ZOMBIE;
   else
     infop->ti_state = TD_THR_UNKNOWN;
Comment 1 Pedro Alves 2014-12-12 18:06:40 UTC
I'll add a special case to GDB: ignore threads with ti_lid == -1.
Comment 2 Pedro Alves 2014-12-16 14:41:52 UTC
I've investigated this some more.

I noticed that the thread's state is not actually TD_THR_ACTIVE just after the thread is joined, before the thread is removed from the thread list, here,
in the code I pasted before:

nptl/pthread_join.c:
pthread_join ()
{
...
  if (__glibc_likely (result == 0))
    {
      /* We mark the thread as terminated and as joined.  */
      pd->tid = -1;
...
     /* Free the TCB.  */
      __free_tcb (pd);
    }


But, I _am_ seeing TD_THR_ACTIVE threads with pd->tid == -1.
Turns out that nothing in __free_tcb clears pd->tid.  So later on, when a new thread reuses the old thread's tcb/stack, the new thread will start out with tid==-1 (reused from the old thread), up until the kernel actually starts the new clone and fills in tid (CLONE_CHILD_SETTID), and it's _that_ thread that has TD_THR_ACTIVE state.  I don't think a new state for when the thread is already listed in the thread list but doesn't have a kernel clone associated yet could help here, as a debugger can always attach between glibc changing the thread state and the kernel filling in the clone's tid.

This made me wonder what happens if a detached thread's tcb/stack is reused.  Or, if a new stack is allocated for a new thread, instead of reused, and gdb lists threads before the kernel spawns the new clone.  In that case, the thread's tid field starts out as 0.  So I thought that just like GDB can see threads with tid=-1, it should also find them with tid=0 as well.  But, turns out it doesn't, because nptl_db/td_thr_get_info.c:td_thr_get_info has this:

  /* Initialization which are the same in both cases.  */
  infop->ti_ta_p = th->th_ta_p;
  infop->ti_lid = tid == 0 ? ps_getpid (th->th_ta_p->ph) : (uintptr_t) tid;
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  infop->ti_traceme = report_events != 0;

Eh.  So ti_lid (same as pd->tid inside the inferior) is never returned as zero.  Instead, for threads that are just being created, GDB is told that their kernel thread ID is the overall thread group id.  But this is wrong.  This can well confuse GDB if it decides to refresh its own thread's state cache (given NPTL's 1:1 model, gdb only keeps track of threads by their kernel ID...)

(I'm guessing that the intent here was that tid == 0 indicates that that this is the main thread and the pthread library isn't fully initialized yet, and so the tgid would be correct.)