This is the mail archive of the
gdb-patches@sourceware.org
mailing list for the GDB project.
Re: [PATCH 0/4 v3] Exec events in gdbserver on Linux
- From: Doug Evans <xdje42 at gmail dot com>
- To: Don Breazeal <donb at codesourcery dot com>
- Cc: "gdb-patches at sourceware dot org" <gdb-patches at sourceware dot org>
- Date: Sun, 25 May 2014 21:55:35 -0700
- Subject: Re: [PATCH 0/4 v3] Exec events in gdbserver on Linux
- Authentication-results: sourceware.org; auth=none
- References: <1398885482-8449-1-git-send-email-donb at codesourcery dot com> <1400885374-18915-1-git-send-email-donb at codesourcery dot com>
On Fri, May 23, 2014 at 3:49 PM, Don Breazeal <donb@codesourcery.com> wrote:
> This patch series is an update to the gdbserver Linux exec event patches based on review comments for the previous version. The changes from the previous version are summarized below.
>
> [...]
>
> RACE CONDITION
>
> This section explains why the existing techniques for detecting thread exit weren't sufficient for gdbserver exec events, necessitating the use of PTRACE_EVENT_EXIT. The short answer is that there is a race condition in the current implementation that can leave a dangling entry in the lwp list (an entry that doesn't have a corresponding actual lwp). In this case gdbserver will hang waiting for the non-existent lwp to stop. Using the exit events eliminates this race condition.
>
> The same race may exist in the native implementation, since the two implementations are similar, but I haven't verified that. It may be difficult to concoct a test case that demonstrates the race since the window is so small.
>
> Now for the long answer: in my testing I ran into a race condition in check_zombie_leaders, which detects when a thread group leader has exited and other threads still exist. On the Linux kernel, ptrace/waitpid don't allow reaping the leader thread until all other threads in the group are reaped. When the leader exits, it goes zombie, but waitpid will not return an exit status until the other threads are gone. When a non-leader thread calls exec, all other non-leader threads are destroyed, the leader becomes a zombie, and once the "other" threads have been reaped, the execing thread takes over the leader's pid (tgid) and appears to vanish. In order to handle this situation, check_zombie_leaders polls the process state in /proc and deletes thread group leaders that are in a zombie state. The replacement is added to the lwp list when the exec event is reported.
>
> See https://sourceware.org/ml/gdb-patches/2011-10/msg00704.html for a more detailed explanation of how this works.
>
> Here is the relevant part of check_zombie_leaders:
>
> if (leader_lp != NULL
> /* Check if there are other threads in the group, as we may
> have raced with the inferior simply exiting. */
> && !last_thread_of_process_p (leader_pid)
> && linux_proc_pid_is_zombie (leader_pid))
> {
> /* ...large informative comment block... */
> delete_lwp (leader_lp);
>
> The race occurred when there were two threads in the program, and the non-leader thread called exec. In this case the leader thread passed through a very brief zombie state before being replaced by the exec'ing thread as the thread group leader. This state transition was asynchronous, with no dependency on anything gdbserver did. Because there were no other threads, there were no thread exit events, and thus there was no synchronization with the leader passing through the zombie state and the exec completing. If there had been more threads, the leader would remain in the zombie state until they were waited for. In the two-thread case, sometimes the leader exit was detected and sometimes it wasn't. (Recall that check_zombie_leaders is polling the state, via linux_proc_pid_is_zombie. The race is between the leader thread passing through the zombie state and check_zombie_leaders testing for zombie state.) If leader exit wasn't detected, gdbserver would end up with a dangl
> ing lwp entry that didn't correspond to any real lwp, and would hang waiting for that lwp to stop. Using PTRACE_EVENT_EXIT guarantees that the leader exit will be detected.
>
> Note that check_zombie_leaders works just fine for the scenarios where the leader thread exits and the other threads continue to run, with no exec calls. It is required for systems that don't support the extended ptrace events.
>
> The sequence of events resulting in the race condition was this:
>
> 1) In the program, a CLONE event for a new thread occurs.
>
> 2) In the program, both threads are resumed once gdbserver has
> completed the new thread processing.
>
> 3) In gdbserver, the function linux_wait_for_event_filtered loops until
> waitpid returns "no more events" for the SIGCHLD generated by the
> CLONE event. Then linux_wait_for_event_filtered calls
> check_zombie_leaders.
>
> 4) In the program, the new thread is doing the exec. During the exec
> the leader thread will pass through a transitory zombie state. If
> there were more than two threads, the leader thread would remain a
> zombie until all the non-leader, non-exec'ing threads were reaped by
> gdbserver. Since there are no such threads to reap, the leader just
> becomes a zombie and is replaced by the exec'ing thread on-the-fly.
> (Note that it appears that the leader thread is a zombie just for a
> very brief instant.)
>
> 5) In gdbserver, check_zombie_leaders checks whether an lwp entry
> corresponds to a zombie leader thread, and if so, deletes it. Here
> is the race: in (4) above, the leader may or may not be in the
> transitory zombie state. In the case where a zombie isn't detected,
> delete_lwp is not called.
>
> 6) In gdbserver, an EXEC event is detected and processed. When it gets
> ready to report the event to GDB, it calls stop_all_lwps, which sends
> a SIGSTOP to each lwp in the list and the waits until all the lwps in
> the list have reported a stop event. If the zombie leader wasn't
> detected and processed in step (5), gdbserver blocks forever in
> linux_wait_for_event_filtered, waiting for the undeleted lwp to be
> stopped, which will never happen.
Hi.
How do I tweak your patch so that I can see the race condition for myself?
[I realize the window is small ... I'd just like to play with it.]
I tried just disabling PTRACE_O_TRACEEXIT but that causes segvs in the
inferior (using non-ldr-exc-1 testcase).
Still digging into that.