Bug 26754

Summary:	Race condition when resuming threads and one does an exec
Product:	gdb	Reporter:	Simon Marchi <simark>
Component:	gdb	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED OBSOLETE
Severity:	normal	CC:	pedro
Priority:	P2
Version:	HEAD
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:
Attachments:	Reproducer (c) Reproducer (asm)

Description Simon Marchi 2020-10-19 01:19:24 UTC

I stumbled on this while trying to write a test for when a non-leader thread displace-steps an exec syscall instruction.  The thing to remember here is that on Linux, when a non-leader thread does an exec syscall, all non-main threads are  removed and only the main thread starts executing the new executable  (I don't know what happens in the kernel's data structure exactly, but at least that's how it looks from userspace, so that's the important part).

Things go wrong when GDB tries to resume multiple threads, necessarily one after the others, and one of these threads (a non-leader one) executes an exec syscall before the main thread is resumed.

I'll attach the source for a reproducer.  It can be compiled with:

  $ gcc test.c -g3 -O0 -o test -pthread test_asm.S -fPIE

I run it with

  $ ../gdb -q -nx --data-directory=data-directory ./test -ex "b the_syscall" -ex "b main" -ex r -ex c

and then just "continue" to trigger the problem.  Normally, 

Since it involves a race, different things can happen if you execute the reproducer multiple times.  I'll describe one possible outcome.  By applying this small patch to GDB, you can pretty much guarantee to have this outcome:

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 8ae39a2877b3..450a7a37bc5b 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -2246,6 +2246,9 @@ do_target_resume (ptid_t resume_ptid, int step, enum gdb_signal sig)
 
   target_commit_resume ();
 
+  for (int i = 0; i < 500; i++)
+    usleep (1000);
+
   if (target_can_async_p ())
     target_async (1);
 }

---

Note that the use of many usleep vs one sleep is because sleep otherwise gets interrupted by incoming SIGCHILD signals.  I'm not sure it's needed but I just wanted to play it safe and really make GDB sleep for a while.

So this is the sequence of events.  Let's assume we have a process with pid 1000 and two threads with tid 1000 (the main one) and 1001 (a user-created one, which will execute the exec).  Both threads are stopped.  Thread 1001 is stopped just before an exec syscall.

1. User does "continue"
2. Since thread 1001 needs to initiate a displaced-step, it is resumed first (the displaced-step is not really at fault here, but having it makes it that we resume this particular thread first, so it helps trigger the issue).
3. The now-resumed thread 1001 does an exec syscall
4. The kernel deletes all non-main threads of process 1000 (so, deletes thread 1001).  It sets up thread 1000 with the new executable.  It sends GDB (the ptracer) a PTRACE_EVENT_EXEC.  The thread is stopped as it will need GDB to continue it before it starts executing the code of the new executable.
5. GDB, still processing the "continue" command, now resumes thread 1000 - which succeeds.

The thing is that the thread 1000 that GDB resumes now isn't the thread it thinks it is resuming.  The thread that GDB thinks it is resuming is the one stopped somewhere in the original executable.  In reality, it resumes the post-exec thread 1000, stopped on the PTRACE_EVENT_EXEC event, about to start execution of the new executable.

So GDB ends up resuming this thread 1000, and all kinds of funny things can happen after that. Normally, on exec, the linux-nat target should report the exec event to the core, which would call follow_exec, which would install breakpoints in the fresh program space, among other things.  None of that is done and the program is resumed, so one visible consequence is that any breakpoint set are not effective.

This is what it looks like when running the reproducer:

---8<---
$ ./gdb -q -nx --data-directory=data-directory ./test -ex "b the_syscall" -ex "b main" -ex r -ex c
Reading symbols from ./test...
Breakpoint 1 at 0x128a: file test_asm.S, line 11.
Breakpoint 2 at 0x11fb: file test.c, line 27.
Starting program: /home/simark/build/binutils-gdb/gdb/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/../lib/libthread_db.so.1".

Breakpoint 2, main (argc=1, argv=0x7fffffffe018) at test.c:27
27      {
Continuing.
[New Thread 0x7ffff7da6640 (LWP 2813391)]
[Switching to Thread 0x7ffff7da6640 (LWP 2813391)]

Thread 2 "test" hit Breakpoint 1, the_syscall () at test_asm.S:11
11              syscall
(gdb) c   <--- this is where things start to go wrong
Continuing.
Welcome
[Thread 0x7ffff7da7740 (LWP 2813387) exited]
...hangs...
--->8---

Normally, we should break on "main", but we missed it and the program ended.  I  think that at this point, GDB still believes there's a thread 2813391 executing that hasn't stopped.

Comment 1 Simon Marchi 2020-10-19 01:20:07 UTC

Created attachment 12910 [details]
Reproducer (c)

Comment 2 Simon Marchi 2020-10-19 01:21:30 UTC

Created attachment 12911 [details]
Reproducer (asm)

Comment 3 Pedro Alves 2020-11-19 20:05:50 UTC

Quite a nasty race. I'm not seeing how this can properly fixed without kernel help.

The main problem comes from the fact that when PTRACE_EVENT_EXEC is reported, the PID of the thread that exec has already changed to the tgid. This fact is described here:

https://github.com/strace/strace/blob/master/README-linux-ptrace

~~~
1.x execve under ptrace.

During execve, kernel destroys all other threads in the process, and
resets execve'ing thread tid to tgid (process id). This looks very
confusing to tracers:
~~~

Also from here, in GDB:

https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/linux-nat.c;h=f1b2c744bed9be01ff21a74e86d2a45b60d4eb65;hb=HEAD#l167

~~~
Exec events
===========

The case of a thread group (process) with 3 or more threads, and a
thread other than the leader execs is worth detailing:

On an exec, the Linux kernel destroys all threads except the execing
one in the thread group, and resets the execing thread's tid to the
tgid. No exit notification is sent for the execing thread -- from the
ptracer's perspective, it appears as though the execing thread just
vanishes. Until we reap all other threads except the leader and the
execing thread, the leader will be zombie, and the execing thread will
be in `D (disc sleep)' state. As soon as all other threads are
reaped, the execing thread changes its tid to the tgid, and the
previous (zombie) leader vanishes, giving place to the "new"
leader. */
~~~

I tried to think of workarounds, like always resuming the leader thread before any other thread, but that doesn't really work, because there are many scenarios where we can't do that. Like, for example, the program has two threads, thread 1 (leader) and thread 2, and the user only resumes thread 2. And then, after a while, the user decides to resume thread 1, but that coincides exactly when thread 2 exec and changes its tid to the tgid... Boom, GDB ends up resuming the exec'ed thread by mistake, without processing the PTRACE_EVENT_EVENT.

Always using PTRACE_SYSCALL instead of PTRACE_CONTINUE and looking for exec syscall entry doesn't work for working around this, because:
#1, it stops before the exec actually happened, and the exec may fail
#2, would be horribly inefficient as it would stop all threads for
all syscalls, including non-exec ones.
#3, can't replace PTRACE_SINGLESTEP

I think that we would need a new event, similar to PTRACE_EVENT_EXEC but that is reported after all threads in the process are exited (and report their exit) and _before_ the tid of the execing thread is changed. Let's call it PTRACE_EVENT_ALMOST_EXEC. At this point, like with PTRACE_EVENT_EXEC (before other threads are reaped), the previous leader would be in zombie state. The process had already loaded the new address space when PTRACE_EVENT_ALMOST_EXEC is reported. So if GDB happens to try to resume the leader just while some other thread execs, it would fail to ptrace-resume it with ESRCH because it was zombie. GDB would react to PTRACE_EVENT_ALMOST_EXEC like it reacts to PTRACE_EVENT_EXEC -- by loading the new symbols, and installing breakpoints in the new address space. Except it wouldn't destroy the non-leader thread yet -- that would happen on PTRACE_EVENT_EXEC.

Comment 4 Simon Marchi 2020-11-19 20:19:02 UTC

Back when discussing this with Mathieu Desnoyers, he made a prototype patch to add a ptrace event that sounds very much like what you are describing:

https://github.com/compudj/linux-dev/commits/ptrace-event-exec-begin

I haven't taken the time to try it yet, but our discussion lead to pretty much the same conclusion as yours.

The important thing is that the event happens at a point where we know the exec is going to go through, at a point of no return (unlike the syscall entry with PTRACE_SYSCALL).

Comment 5 Simon Marchi 2020-11-19 21:32:45 UTC

Just a precision about this:

> I think that we would need a new event, similar to PTRACE_EVENT_EXEC but
> that is reported after all threads in the process are exited (and report
> their exit) and _before_ the tid of the execing thread is changed.  Let's
> call it PTRACE_EVENT_ALMOST_EXEC.  At this point, like with
> PTRACE_EVENT_EXEC (before other threads are reaped), the previous leader
> would be in zombie state.  The process had already loaded the new address
> space when PTRACE_EVENT_ALMOST_EXEC is reported.  So if GDB happens to try
> to resume the leader just while some other thread execs, it would fail to
> ptrace-resume it with ESRCH because it was zombie.  GDB would react to
> PTRACE_EVENT_ALMOST_EXEC like it reacts to PTRACE_EVENT_EXEC -- by loading
> the new symbols, and installing breakpoints in the new address space. 
> Except it wouldn't destroy the non-leader thread yet -- that would happen on
> PTRACE_EVENT_EXEC.

If we look at fs/exec.c in the kernel, around function "de_thread":

https://github.com/torvalds/linux/blob/3494d58865ad4a47611dbb427b214cc5227fa5eb/fs/exec.c

It looks like other threads are deleted first, and the new address space is set up second.  What you describes implies that it's the other way around.

The way I see it could work is like this.  Let's say the leader is 100.100 and the non-leader is 100.101.

1. non-leader execs
2. all other threads are removed
3. PTRACE_EVENT_ALMOST_EXEC is fired: GDB deletes other threads from its data structures
4. GDB resumes execution of 100.101
5. kernel renames 100.101 into 100.100 and sets up new address space
6. PTRACE_EVENT_EXEC is fired: GDB updates the ptid from 100.101 to 100.100 in its data structures, loads the new symbols
7. GDB resumes execution of 100.100

Comment 6 Pedro Alves 2020-11-19 22:57:55 UTC

> It looks like other threads are deleted first, and the new address space is set 
> up second.  What you describes implies that it's the other way around.

Well, I said "that is reported after all threads in the process are exited".  By "all threads" here I was talking about all threads except the leader and the one that exceed, like how PTRACE_EVENT_EXEC event also works.

> It looks like other threads are deleted first, and the new address space is set 
> up second.  What you describes implies that it's the other way around.

It looks more like you're talking about the order of changing the exec'ing thread's tid to the tgid vs setting up the new address space.  OK.

> The way I see it could work is like this.  Let's say the leader is 100.100 and 
> the non-leader is 100.101.
> 
> 1. non-leader execs
> 2. all other threads are removed
> 3. PTRACE_EVENT_ALMOST_EXEC is fired: GDB deletes other threads from its data structures
> 4. GDB resumes execution of 100.101
> 5. kernel renames 100.101 into 100.100 and sets up new address space
> 6. PTRACE_EVENT_EXEC is fired: GDB updates the ptid from 100.101 to 100.100 in > its data structures, loads the new symbols
> 7. GDB resumes execution of 100.100

OK, yeah, the main difference is that I was saying that the new address space would be created between #2 and #3, and GDB would load the new symbols in #3 already.  I think what you have should work too, and is nicer in that the PTRACE_EVENT_EXEC handling continues the same and PTRACE_EVENT_ALMOST_EXEC is a smaller incremental change, a pure addition.  I like it.

In #3, I assume that the leader thread will still exist, but be zombie, like what happens currently with PTRACE_EVENT_EXEC until you reap all threads.  GDB is free to delete it from its data structures then, it's what we already do when the leader exits even without considering exec events -- the leader never disappears, it already refuses to die and stays as a zombie when it exits.  I would prefer to preserve gdb thread id == 1 before / after exec, but that's a GDB concern, not a kernel interface concern.

Ship it!

Comment 7 Simon Marchi 2020-11-19 23:11:00 UTC

(In reply to Pedro Alves from comment #6)
> In #3, I assume that the leader thread will still exist, but be zombie, like
> what happens currently with PTRACE_EVENT_EXEC until you reap all threads. 
> GDB is free to delete it from its data structures then, it's what we already
> do when the leader exits even without considering exec events -- the leader
> never disappears, it already refuses to die and stays as a zombie when it
> exits.

That looks right.  Notice that the new event is fired just after that loop (note that these links may become dead if the branch is forced-pushed):

https://github.com/compudj/linux-dev/blob/1c9ed9f9b432ccb58e5344c83568e1a66f52843b/fs/exec.c#L1204-L1220

In that loop, we wait for `leader->exit_state` to become true.  It would make sense that this denotes the zombie state, but I don't actually know.  The comment a few lines above says:

	/*
	 * At this point all other threads have exited, all we have to
	 * do is to wait for the thread group leader to become inactive,
	 * and to assume its PID:
	 */

By "wait for the thread group leader to become inactive", they probably mean "wait for it to become zombie", so again it matches what you say.

Also, note that the event is also sent in the else branch of:

  if (!thread_group_leader(tsk)) {
    ...
  } else {
    ptrace_event(PTRACE_EVENT_EXEC_BEGIN, old_vpid);
  }

To cover for the case where the leader does the exec.

> I would prefer to preserve gdb thread id == 1 before / after exec,
> but that's a GDB concern, not a kernel interface concern.

I don't understand what you mean here.

Comment 8 Pedro Alves 2020-11-19 23:52:58 UTC

>> I would prefer to preserve gdb thread id == 1 before / after exec,
>> but that's a GDB concern, not a kernel interface concern.
> I don't understand what you mean here.

If we delete the zombie leader in reaction to PTRACE_EVENT_EXEC_BEGIN, then when we get the PTRACE_EVENT_EXEC, we'll create a new thread for the new leader, and that will end up with a GDB tid != 1.

   1.1     main ()
 * 1.2     thread()

and after exec we have:

 * 1.3     main ()

I'm thinking that it would be better to end up with:

 * 1.1     main ()

Similar to how if a non-threaded program execs, the main thread's thread number doesn't change.

Comment 9 Simon Marchi 2020-11-19 23:55:37 UTC

(In reply to Pedro Alves from comment #8)
> >> I would prefer to preserve gdb thread id == 1 before / after exec,
> >> but that's a GDB concern, not a kernel interface concern.
> > I don't understand what you mean here.
> 
> If we delete the zombie leader in reaction to PTRACE_EVENT_EXEC_BEGIN, then
> when we get the PTRACE_EVENT_EXEC, we'll create a new thread for the new
> leader, and that will end up with a GDB tid != 1.
> 
>    1.1     main ()
>  * 1.2     thread()
> 
> and after exec we have:
> 
>  * 1.3     main ()
> 
> I'm thinking that it would be better to end up with:
> 
>  * 1.1     main ()
> 
> Similar to how if a non-threaded program execs, the main thread's thread
> number doesn't change.

Ah yeah of course, we definitely want that behavior.  I don't fully understand what you say (I would need to go study the code), but it doesn't matter right now, it's all internal GDB bookkeeping.

Comment 10 Simon Marchi 2022-09-29 16:58:41 UTC

This all ended with this behavior change in the kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dbb5afad100a828c97e012c6106566d99f041db6

I don't think we need to do anything on the GDB side.