31069 – Zombie leader detection racy

Bug 31069 - Zombie leader detection racy

Summary: Zombie leader detection racy

Status:	NEW

Alias:	None

Product:	gdb
Classification:	Unclassified
Component:	gdb (show other bugs)
Version:	unknown

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-11-15 17:51 UTC by Pedro Alves
Modified:	2023-11-15 18:06 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pedro Alves 2023-11-15 17:51:30 UTC

Simon noticed that gdb.threads/threads-after-exec.exp was racy.  You can consistenly reproduce it (at git hash 319b460545dc79280e2904dcc280057cf71fb753), with:

  $ taskset -c 0 make check TESTS="gdb.threads/threads-after-exec.exp"

This is yet another case of zombie leader detection making things a bit fuzzy.

In the passing case, we have:

 continue
 Continuing.
 [New Thread 0x7ffff7bff640 (LWP 603183)]
 [Thread 0x7ffff7bff640 (LWP 603183) exited]
 process 603180 is executing new program: .../gdb.threads/threads-after-exec/threads-after-exec

While in the failing case, we have (note remarks on the rhs):

 continue
 Continuing.
 [New Thread 0x7ffff7bff640 (LWP 600205)]
 [Thread 0x7ffff7f95740 (LWP 600202) exited]   <<< gdb deletes leader thread, thread 1.
 [New LWP 600202]                              <<< gdb adds it back -- this is now thread 3.
 [Thread 0x7ffff7bff640 (LWP 600205) exited]
 process 600202 is executing new program: .../threads-after-exec/threads-after-exec
 [Switching to process 600202]
 Thread 3 "threads-after-e" hit Catchpoint 2 (exec'd .../gdb.threads/threads-after-exec/threads-after-exec), 0x00007ffff7fe3290
  in _start () from /lib64/ld-linux-x86-64.so.2

The testcase only has two threads, yet GDB presented the exec for thread 3.  This is GDB deleting the leader (the backend detected it was zombie, due to the exec), and then added it back when it saw the exec event.  The testcase isn't expecting that the remaining thread after the exec is any other than thread 1.

I'm not sure there's anything we can do easily do on the gdb side.  Recreating the leader thread is one option, but I'm not fully sure of the consequences, like e.g., the previous thread 1 will probably still exist in the thread list as THREAD_EXITED, if it was the selected thread.

Maybe we can make use of PTRACE_O_TRACEEXIT / PTRACE_EVENT_EXIT, and model a "zombie" state in the core, so if the leader exits, we keep listing it, but GDB wouldn't try to stop that thread or read its registers.  After an exec, the zombie thread would go back to being a normal thread.  The next question would be how to model this in the remote protocol.

Comment 1 Sourceware Commits 2023-11-15 18:06:11 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=d2eca84d73a66cf93acbf14522efc835e4446f57

commit d2eca84d73a66cf93acbf14522efc835e4446f57
Author: Pedro Alves <pedro@palves.net>
Date:   Tue Nov 14 11:47:15 2023 +0000

    Fix gdb.threads/threads-after-exec.exp race
    
    Simon noticed that gdb.threads/threads-after-exec.exp was racy.  You
    can consistenly reproduce it (at git hash
    319b460545dc79280e2904dcc280057cf71fb753), with:
    
      $ taskset -c 0 make check TESTS="gdb.threads/threads-after-exec.exp"
    
    gdb.log shows:
    
      (...)
      Thread 3 "threads-after-e" hit Catchpoint 2 (exec'd .../gdb.threads/threads-after-exec/threads-after-exec), 0x00007ffff7fe3290
       in _start () from /lib64/ld-linux-x86-64.so.2
      (gdb) PASS: gdb.threads/threads-after-exec.exp: continue until exec
      info threads
        Id   Target Id                         Frame
      * 3    process 1443269 "threads-after-e" 0x00007ffff7fe3290 in _start () from /lib64/ld-linux-x86-64.so.2
      (gdb) FAIL: gdb.threads/threads-after-exec.exp: info threads
      (...)
      maint info linux-lwps
      LWP Ptid          Thread ID
      1443269.1443269.0 1.3
      (gdb) FAIL: gdb.threads/threads-after-exec.exp: maint info linux-lwps
    
    The FAILs happen because the .exp file expects that after the exec,
    the only thread has GDB thread number 1, but it has instead 3.
    
    This is yet another case of zombie leader detection making things a
    bit fuzzy.
    
    In the passing case, we have:
    
     continue
     Continuing.
     [New Thread 0x7ffff7bff640 (LWP 603183)]
     [Thread 0x7ffff7bff640 (LWP 603183) exited]
     process 603180 is executing new program: .../gdb.threads/threads-after-exec/threads-after-exec
    
    While in the failing case, we have (note remarks on the rhs):
    
     continue
     Continuing.
     [New Thread 0x7ffff7bff640 (LWP 600205)]
     [Thread 0x7ffff7f95740 (LWP 600202) exited]   <<< gdb deletes leader thread, thread 1.
     [New LWP 600202]                              <<< gdb adds it back -- this is now thread 3.
     [Thread 0x7ffff7bff640 (LWP 600205) exited]
     process 600202 is executing new program: .../threads-after-exec/threads-after-exec
    
    The testcase only has two threads, yet GDB presented the exec for
    thread 3.  This is GDB deleting the leader (the backend detected it
    was zombie, due to the exec), and then adding the leader back when it
    saw the exec event.
    
    I've recorded some thoughts about this in PR gdb/31069.
    
    For now, this commit just makes the testcase cope with the non-one
    thread number, as the number is not important for what this test is
    exercising.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31069
    Change-Id: Id80b5c73f09c9e0005efeb494cca5d066ac3bbae