[RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads

Tue Apr 16 04:48:53 GMT 2024

Hello Pedro,

[ I just now noticed that I replied to you in private last time. Sorry
  about that, I thought I had cc'd the list. ]

Pedro Alves <pedro@palves.net> writes:

> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>> When GDB attaches to a multi-threaded process, it calls
>> linux_proc_attach_tgid_threads () to go through all threads found in
>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
>> them.  If it does that twice without the callback reporting that a new
>> thread was found, then it considers that all inferior threads have been
>> found and returns.
>>
>> The problem is that the callback considers any thread that it hasn't
>> attached to yet as new.  This causes problems if the process has one or
>> more zombie threads, because GDB can't attach to it and the loop will
>> always "find" a new thread (the zombie one), and get stuck in an
>> infinite loop.
>>
>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
>> its test program constantly creates and finishes joinable threads so the
>> chance of having zombie threads is high.
>
> Hmm.  How about simply not restarting the loop if attach_lwp tries to attach to
> a zombie lwp (and silently fails)?
>
> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:
>
>       if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
> 	{
> 	  int err = errno;
>
> 	  /* Be quiet if we simply raced with the thread exiting.
> 	     EPERM is returned if the thread's task still exists, and
> 	     is marked as exited or zombie, as well as other
> 	     conditions, so in that case, confirm the status in
> 	     /proc/PID/status.  */
> 	  if (err == ESRCH
> 	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
> 	    {
> 	      linux_nat_debug_printf
> 		("Cannot attach to lwp %d: thread is gone (%d: %s)",
> 		 lwpid, err, safe_strerror (err));
>
>               return 0;        <<<< NEW RETURN
> 	    }
>
> Similar thing for gdbserver, of course.

Thank you for the suggestion. I tried doing that, and in fact I attached
a patch with that change in comment #17 of PR 31312 when I was
investigating a fix¹. I called it a workaround because I also had to
increase the number of iterations in linux_proc_attach_tgid_threads from
2 to 100, otherwise GDB gives up on waiting for new inferior threads too
early and the inferior dies with a SIGTRAP because some new unnoticed
thread tripped into the breakpoint.

Because of the need to increase the number of iterations, I didn't
consider it a good solution and went with the approach in this patch
series instead. Now I finally understand why I had to increase the
number of iterations (though I still don't see a way around it other
than what this patch series does):

With the approach in this patch series, even if a new thread becomes
zombie by the time GDB tries to attach to it, it still causes
new_threads_found to be set the first time GDB notices it, and the loop
in linux_proc_attach_tgid_threads starts over.

With the approach above, a new thread that becomes zombie before GDB has
a chance to attach to it never causes the loop to start over, and so it
exits earlier.

I think it's a matter of opinion whether one approach or the other can
be considered the better one.

Even with this patch series, it's not guaranteed that two iterations
without finding new threads is enough to ensure that GDB has found all
threads in the inferior. I left the test running in a loop overnight
with the patch series applied and it failed after about 2500 runs.

--
Thiago

¹ https://sourceware.org/bugzilla/attachment.cgi?id=15405&action=diff