This is the mail archive of the
gdb-patches@sourceware.org
mailing list for the GDB project.
Re: [PATCH 0/4 v3] Exec events in gdbserver on Linux
- From: "Breazeal, Don" <donb at codesourcery dot com>
- To: Doug Evans <xdje42 at gmail dot com>
- Cc: "gdb-patches at sourceware dot org" <gdb-patches at sourceware dot org>
- Date: Tue, 27 May 2014 14:41:15 -0700
- Subject: Re: [PATCH 0/4 v3] Exec events in gdbserver on Linux
- Authentication-results: sourceware.org; auth=none
- References: <1398885482-8449-1-git-send-email-donb at codesourcery dot com> <1400885374-18915-1-git-send-email-donb at codesourcery dot com> <CAP9bCMSLJwYn3RHzAwvfE6T05SdsEKXY1bivXwjOjrdZsvjwoQ at mail dot gmail dot com> <5384DE3A dot 8080205 at codesourcery dot com>
On 5/27/2014 11:49 AM, Breazeal, Don wrote:
> On 5/25/2014 9:55 PM, Doug Evans wrote:
>> On Fri, May 23, 2014 at 3:49 PM, Don Breazeal <donb@codesourcery.com> wrote:
>>> This patch series is an update to the gdbserver Linux exec event patches based on review comments for the previous version. The changes from the previous version are summarized below.
>>>
>>> [...]
>>>
>>> RACE CONDITION
>>>
>>> This section explains why the existing techniques for detecting thread exit weren't sufficient for gdbserver exec events, necessitating the use of PTRACE_EVENT_EXIT. The short answer is that there is a race condition in the current implementation that can leave a dangling entry in the lwp list (an entry that doesn't have a corresponding actual lwp). In this case gdbserver will hang waiting for the non-existent lwp to stop. Using the exit events eliminates this race condition.
>>>
>>> The same race may exist in the native implementation, since the two implementations are similar, but I haven't verified that. It may be difficult to concoct a test case that demonstrates the race since the window is so small.
>>>
>>> Now for the long answer: in my testing I ran into a race condition in check_zombie_leaders, which detects when a thread group leader has exited and other threads still exist. On the Linux kernel, ptrace/waitpid don't allow reaping the leader thread until all other threads in the group are reaped. When the leader exits, it goes zombie, but waitpid will not return an exit status until the other threads are gone. When a non-leader thread calls exec, all other non-leader threads are destroyed, the leader becomes a zombie, and once the "other" threads have been reaped, the execing thread takes over the leader's pid (tgid) and appears to vanish. In order to handle this situation, check_zombie_leaders polls the process state in /proc and deletes thread group leaders that are in a zombie state. The replacement is added to the lwp list when the exec event is reported.
>>>
>>> See https://sourceware.org/ml/gdb-patches/2011-10/msg00704.html for a more detailed explanation of how this works.
>>>
>>> Here is the relevant part of check_zombie_leaders:
>>>
>>> if (leader_lp != NULL
>>> /* Check if there are other threads in the group, as we may
>>> have raced with the inferior simply exiting. */
>>> && !last_thread_of_process_p (leader_pid)
>>> && linux_proc_pid_is_zombie (leader_pid))
>>> {
>>> /* ...large informative comment block... */
>>> delete_lwp (leader_lp);
>>>
>>> The race occurred when there were two threads in the program, and the non-leader thread called exec. In this case the leader thread passed through a very brief zombie state before being replaced by the exec'ing thread as the thread group leader. This state transition was asynchronous, with no dependency on anything gdbserver did. Because there were no other threads, there were no thread exit events, and thus there was no synchronization with the leader passing through the zombie state and the exec completing. If there had been more threads, the leader would remain in the zombie state until they were waited for. In the two-thread case, sometimes the leader exit was detected and sometimes it wasn't. (Recall that check_zombie_leaders is polling the state, via linux_proc_pid_is_zombie. The race is between the leader thread passing through the zombie state and check_zombie_leaders testing for zombie state.) If leader exit wasn't detected, gdbserver would end up with a dangl
>>> ing lwp entry that didn't correspond to any real lwp, and would hang waiting for that lwp to stop. Using PTRACE_EVENT_EXIT guarantees that the leader exit will be detected.
>>>
>>> Note that check_zombie_leaders works just fine for the scenarios where the leader thread exits and the other threads continue to run, with no exec calls. It is required for systems that don't support the extended ptrace events.
>>>
>>> The sequence of events resulting in the race condition was this:
>>>
>>> 1) In the program, a CLONE event for a new thread occurs.
>>>
>>> 2) In the program, both threads are resumed once gdbserver has
>>> completed the new thread processing.
>>>
>>> 3) In gdbserver, the function linux_wait_for_event_filtered loops until
>>> waitpid returns "no more events" for the SIGCHLD generated by the
>>> CLONE event. Then linux_wait_for_event_filtered calls
>>> check_zombie_leaders.
>>>
>>> 4) In the program, the new thread is doing the exec. During the exec
>>> the leader thread will pass through a transitory zombie state. If
>>> there were more than two threads, the leader thread would remain a
>>> zombie until all the non-leader, non-exec'ing threads were reaped by
>>> gdbserver. Since there are no such threads to reap, the leader just
>>> becomes a zombie and is replaced by the exec'ing thread on-the-fly.
>>> (Note that it appears that the leader thread is a zombie just for a
>>> very brief instant.)
>>>
>>> 5) In gdbserver, check_zombie_leaders checks whether an lwp entry
>>> corresponds to a zombie leader thread, and if so, deletes it. Here
>>> is the race: in (4) above, the leader may or may not be in the
>>> transitory zombie state. In the case where a zombie isn't detected,
>>> delete_lwp is not called.
>>>
>>> 6) In gdbserver, an EXEC event is detected and processed. When it gets
>>> ready to report the event to GDB, it calls stop_all_lwps, which sends
>>> a SIGSTOP to each lwp in the list and the waits until all the lwps in
>>> the list have reported a stop event. If the zombie leader wasn't
>>> detected and processed in step (5), gdbserver blocks forever in
>>> linux_wait_for_event_filtered, waiting for the undeleted lwp to be
>>> stopped, which will never happen.
>>
>> Hi.
>>
>> How do I tweak your patch so that I can see the race condition for myself?
>> [I realize the window is small ... I'd just like to play with it.]
>
> Hi Doug, thanks for looking at this. I will have to look back through
> my notes to see if I can come up with a reasonable way of doing this. I
> was using the same test case that you were, running the basic test
> manually. I saw the race condition by inserting fprintf(stderr,.. into
> linux_proc_pid_has_state and check_zombie_leaders, running with no debug
> output.
>
> From my notes:
> ----------------------------------------------
> # result of if stmt condition in check_zombie_leaders
> # original leader is in zombie state
> # linux_proc_pid_has_state: procfile:
> # State: Z (zombie)
> #
> # result inside 'if' stmt in check_zombie_leaders - execing
> # thread has replaced original leader since we evaluated
> # the 'if' condition
> # linux_proc_pid_has_state: procfile:
> # State: R (running)
> #
> # printed inside if stmt, required zombie=1 to get here
> # we still think we have 2 lwps, but after the exec there
> # is only one. zombie=0 came from call to linux_proc_pid_has_state
> # above.
> # check_zombie_leaders: leader_pid=30981, leader_lp!=NULL=1, num_lwps=2,
> # zombie=0
> # ----------------------------------------------
>
My text editor "helped" me with extra comment characters. Here's the
corrected trace log from my notes. The '#' lines are my annotations,
the other lines were actual trace log output.
----------------------------------------------
# result of if stmt condition in check_zombie_leaders
# original leader is in zombie state
linux_proc_pid_has_state: procfile:
State: Z (zombie)
# result inside 'if' stmt in check_zombie_leaders - execing
# thread has replaced original leader since we evaluated
# the 'if' condition
linux_proc_pid_has_state: procfile:
State: R (running)
# printed inside if stmt, required zombie=1 to get here
# we still think we have 2 lwps, but after the exec there
# is only one. zombie=0 came from call to linux_proc_pid_has_state
# above.
check_zombie_leaders: leader_pid=30981, leader_lp!=NULL=1, num_lwps=2,
zombie=0
----------------------------------------------
> The race could manifest in ways other than the one demonstrated here, of
> course.
>
>>
>> I tried just disabling PTRACE_O_TRACEEXIT but that causes segvs in the
>> inferior (using non-ldr-exc-1 testcase).
>> Still digging into that.
>>
> I made the EXEC event support depend on the EXIT events. I have no idea
> what would happen if the EXIT events were disabled.
>
> Thanks
> --Don
>