17705 – nptl_db: stale thread create/death events if debugger detaches

Bug 17705 - nptl_db: stale thread create/death events if debugger detaches

Summary: nptl_db: stale thread create/death events if debugger detaches

Status:	NEW

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	nptl (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-12-12 17:28 UTC by Pedro Alves
Modified:	2014-12-18 08:50 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
WIP fix (1.50 KB, patch) 2014-12-12 17:44 UTC, Pedro Alves	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pedro Alves 2014-12-12 17:28:20 UTC

I wrote a GDB test that attaches to a program that is constantly/quickly spawning short-lived threads.  The test makes GDB attach, have threads hit a breakpoint, detach, and then reattaches, rinse/repeat.

Sometimes, the test fails with a surprising libthread_db error:

 (gdb) continue
 Continuing.
 Cannot get thread event message: debugger service failed
 (gdb)

Investigation showed that that test exposes a libthread_db issue.

If we detach just after a thread had decided that it needs to report an event to the debugger (thread creation or death), and before the event is actually queued (in __nptl_last_event), and the event function (__nptl_create_event or __nptl_death_event) is called, the debugger won't be around to consume the event, but the thread will still be left dangling in the __nptl_last_event event queue/list.

__pthread_create_2_1():
...
  /* Start the thread.  */
  if (__glibc_unlikely (report_thread_creation (pd)))
    {
...
      retval = create_thread (pd, iattr, true, STACK_VARIABLES_ARGS,
			      &thread_ran);
      if (retval == 0)
	{
...
          pd->eventbuf.eventnum = TD_CREATE;
          pd->eventbuf.eventdata = pd;

          /* Enqueue the descriptor.  */
          do
            pd->nextevent = __nptl_last_event;
          while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event,
                                                       pd, pd->nextevent)
                                                     != 0);

          /* Now call the function which signals the event.  */
          __nptl_create_event ();
...


That is, if the debugger detaches after the report_thread_creation check and before the __nptl_create_event call.

Later when the thread dies, if it has a glibc managed stack, and its stack is reused, its event buffer is cleared, but, __nptl_last_event (or a thread in the chain that itself is __nptl_last_event ultimately) still has a stale pointer to to it.

So if another GDB reattaches, when any thread pushes another event, the new GDB fetches the events out of libthread_db, with td_ta_event_getmsg.  Now td_ta_event_getmsg finds a stale pointer to the resumed thread stack in the event list, with no event, which fails with TD_DBERR:

td_err_e
td_ta_event_getmsg (const td_thragent_t *ta_arg, td_event_msg_t *msg)
{
...
  /* If the structure is on the list there better be an event recorded.  */
  if ((int) (uintptr_t) eventnum == TD_EVENT_NONE)
    return TD_DBERR;
...

And thus GDB's "debugger service failed" error message.

If the thread had been allocated on a user provided stack, then the failures modes will even be more "interesting", possibly even corrupting the inferior, as that TD_EVENT_NONE check (and a similar one in td_thr_event_getmsg) might well be fooled, for reading from a dangling pointer.

Comment 1 Pedro Alves 2014-12-12 17:44:28 UTC

Created attachment 8010 [details]
WIP fix

WIP fix.  The main idea is that whenever a __nptl_*_event event function is called, the debugger is expected to consume the event (see nptl_db/td_ta_event_getmsg.c).  If the event wasn't consumed, then it must be the debugger is either gone, or misbehaved.  The death event is to reason about -- clearly we shouldn't hang on the event forever, as the thread is about to be wiped out.  So right after calling the event function, we remove the thread from the event queue.  The complication is that a new debugger may manage to reattach just while we're doing that, and the code must be written in a way that works without any locking/synchronization between the debugger and the inferior.

Comment 2 Pedro Alves 2014-12-12 17:50:17 UTC

I'm not going to be working on this fix further, at least for now.

GDB doesn't really need to be using libthread_db's thread creation/destruction events nowadays when the kernel supports PTRACE_EVENT_CLONE, which it has for a long while.  I'll work on that instead.