This is the mail archive of the
gdb-patches@sources.redhat.com
mailing list for the GDB project.
Re: [PATCH RFC] linux threads: prevent starvation
- To: gdb-patches at sources dot redhat dot com
- Subject: Re: [PATCH RFC] linux threads: prevent starvation
- From: Michael Snyder <msnyder at cygnus dot com>
- Date: Fri, 06 Jul 2001 12:12:35 -0700
- Newsgroups: cygnus.patches.gdb
- Organization: Red Hat
- References: <3B26BCE9.AA6731EC@cygnus.com>
Committed. Mark Kettenis, thanks for your review.
Michael Snyder wrote:
>
> OK, this deserves some explanation.
>
> Problem: Debugging with GDB can cause some threads to starve
> in a multi-threaded Linux program.
>
> Testcase: gdb/testsuite/gdb.threads/pthreads.exp
> FAIL: gdb.threads/pthreads.exp: Some threads didn't run.
>
> Cause: In lin_lwp_wait, we call waitpid (-1, __WCLONE).
> If more than one LWP is currently stopped at a breakpoint,
> the highest-numbered one will be returned. All the others
> will have their breakpoints pushed back. If we then continue,
> and the highest-numbered LWP has time to hit a breakpoint
> again, no one else will have a chance to run.
>
> Solution: Instead of always accepting the event for the
> first LWP returned by waitpid, use a monte carlo method
> to select among all LWPs that have breakpoint events.
>
> Implementation: I've made use of the lwp list and saved status
> that Mark has already implemented. I've saved the SIGTRAP events
> along with any other events accidentally fetched by stop_wait_callback,
> and I've added the event fetched by lin_lwp_wait to the same queue.
> Then I've added code that examines the list of fetched SIGTRAP events
> and chooses one at random, making that LWP the event LWP and returning
> its event to gdb.
>
> This passes the testsuite (where the old method fails), and I've
> done extensive hand testing as well. In fact, this has exposed
> a number of bugs in core gdb's thread handling, some of which I've
> already submitted patches for. Others are still to come. With
> all of these patches in place, this code seems to be quite robust.
> I can put a breakpoint in code that is shared by several threads,
> and keep them all stepping and nexting indefinitely.
>
> ----------------------------------------------------------------------------------------------------
> 2001-06-12 Michael Snyder <msnyder@redhat.com>
>
> * lin-lwp.c: Prevent thread starvation by using a monte carlo
> method to choose which of several event threads to handle next.
>
> (resume_callback): Disable code that attempts to handle
> step_resume breakpoints. Let core gdb handle this.
> (stop_wait_callback): Defer pushback of breakpoint events until
> later; add SIGTRAP events to the queue of unhandled events.
> Keep calling waitpid until SIGSTOP retrieved. If more than one
> non-SIGSTOP event is retrieved, push them back onto the process
> queue using kill.
> (count_events_callback, select_singlestep_lwp_callback,
> select_event_lwp_callback, cancel_breakpoints_callback,
> select_event_lwp): New functions. Implement monte carlo method
> for selecting which of several SIGTRAP threads to handle next.
> Push back the breakpoint event for all threads other than the
> selected one.
> (lin_lwp_wait): Call select_event_lwp to decide which of several
> sigtrapped lwps to handle next.
>
> Index: lin-lwp.c
> ===================================================================
> RCS file: /cvs/src/src/gdb/lin-lwp.c,v
> retrieving revision 1.24
> diff -c -3 -p -r1.24 lin-lwp.c
> *** lin-lwp.c 2001/06/07 19:31:10 1.24
> --- lin-lwp.c 2001/06/13 00:50:29
> *************** resume_callback (struct lwp_info *lp, vo
> *** 458,464 ****
> {
> struct thread_info *tp;
>
> ! #if 1
> /* FIXME: kettenis/2000-08-26: This should really be handled
> properly by core GDB. */
>
> --- 458,464 ----
> {
> struct thread_info *tp;
>
> ! #if 0
> /* FIXME: kettenis/2000-08-26: This should really be handled
> properly by core GDB. */
>
> *************** stop_wait_callback (struct lwp_info *lp,
> *** 585,591 ****
> pid_t pid;
> int status;
>
> - get_another_event:
> gdb_assert (lp->status == 0);
>
> pid = waitpid (GET_LWP (lp->ptid), &status,
> --- 585,590 ----
> *************** stop_wait_callback (struct lwp_info *lp,
> *** 619,631 ****
> }
>
> gdb_assert (WIFSTOPPED (status));
> - lp->stopped = 1;
>
> if (WSTOPSIG (status) != SIGSTOP)
> {
> ! if (WSTOPSIG (status) == SIGTRAP
> ! && breakpoint_inserted_here_p (read_pc_pid (pid_to_ptid (pid))
> ! - DECR_PC_AFTER_BREAK))
> {
> /* If a LWP other than the LWP that we're reporting an
> event for has hit a GDB breakpoint (as opposed to
> --- 618,627 ----
> }
>
> gdb_assert (WIFSTOPPED (status));
>
> if (WSTOPSIG (status) != SIGSTOP)
> {
> ! if (WSTOPSIG (status) == SIGTRAP)
> {
> /* If a LWP other than the LWP that we're reporting an
> event for has hit a GDB breakpoint (as opposed to
> *************** stop_wait_callback (struct lwp_info *lp,
> *** 640,660 ****
> user will delete or disable the breakpoint, but the
> thread will have already tripped on it. */
>
> - if (debug_lin_lwp)
> - fprintf_unfiltered (gdb_stdlog,
> - "Tripped breakpoint at %lx in LWP %d"
> - " while waiting for SIGSTOP.\n",
> - (long) read_pc_pid (lp->ptid), pid);
> -
> - /* Set the PC to before the trap. */
> - if (DECR_PC_AFTER_BREAK)
> - write_pc_pid (read_pc_pid (pid_to_ptid (pid))
> - - DECR_PC_AFTER_BREAK,
> - pid_to_ptid (pid));
> -
> /* Now resume this LWP and get the SIGSTOP event. */
> ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
> ! goto get_another_event;
> }
> else if (WSTOPSIG (status) == SIGINT &&
> signal_pass_state (SIGINT) == 0)
> --- 636,657 ----
> user will delete or disable the breakpoint, but the
> thread will have already tripped on it. */
>
> /* Now resume this LWP and get the SIGSTOP event. */
> ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
> ! if (debug_lin_lwp)
> ! {
> ! fprintf_unfiltered (gdb_stderr,
> ! "SWC: Candidate SIGTRAP event in %ld\n",
> ! GET_LWP (lp->ptid));
> ! }
> ! /* Hold the SIGTRAP for handling by lin_lwp_wait. */
> ! stop_wait_callback (lp, data);
> ! /* If there's another event, throw it back into the queue. */
> ! if (lp->status)
> ! kill (GET_LWP (lp->ptid), WSTOPSIG (lp->status));
> ! /* Save the sigtrap event. */
> ! lp->status = status;
> ! return 0;
> }
> else if (WSTOPSIG (status) == SIGINT &&
> signal_pass_state (SIGINT) == 0)
> *************** stop_wait_callback (struct lwp_info *lp,
> *** 664,690 ****
> just ignore all SIGINT events from all lwp's except for
> the one that was caught by lin_lwp_wait. */
>
> ! /* Now resume this LWP and get the SIGSTP event. */
> ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
> ! goto get_another_event;
> }
> else
> {
> if (debug_lin_lwp)
> ! fprintf_unfiltered (gdb_stdlog,
> ! "Received %s in LWP %d while waiting for SIGSTOP.\n",
> ! strsignal (WSTOPSIG (status)), pid);
>
> ! /* The thread was stopped with a signal other than
> ! SIGSTOP, and didn't accidentally trip a breakpoint.
> ! Record the wait status. */
> ! lp->status = status;
> }
> }
> else
> {
> /* We caught the SIGSTOP that we intended to catch, so
> there's no SIGSTOP pending. */
> lp->signalled = 0;
> }
> }
> --- 661,702 ----
> just ignore all SIGINT events from all lwp's except for
> the one that was caught by lin_lwp_wait. */
>
> ! /* Now resume this LWP and get the SIGSTOP event. */
> ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
> ! return stop_wait_callback (lp, data);
> }
> else
> {
> + /* The thread was stopped with a signal other than
> + SIGSTOP, and didn't accidentally trip a breakpoint. */
> +
> if (debug_lin_lwp)
> ! {
> ! fprintf_unfiltered (gdb_stderr,
> ! "SWC: Pending event %d in %ld\n",
> ! WSTOPSIG (status), GET_LWP (lp->ptid));
> ! }
> ! /* Now resume this LWP and get the SIGSTOP event. */
> ! ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
>
> ! /* Hold this event/waitstatus while we check to see if
> ! there are any more (we still want to get that SIGSTOP). */
> ! stop_wait_callback (lp, data);
> ! /* If the lp->status field is still empty, use it to hold
> ! this event. If not, then this event must be returned
> ! to the event queue of the LWP. */
> ! if (lp->status == 0)
> ! lp->status = status;
> ! else
> ! kill (GET_LWP (lp->ptid), WSTOPSIG (status));
> ! return 0;
> }
> }
> else
> {
> /* We caught the SIGSTOP that we intended to catch, so
> there's no SIGSTOP pending. */
> + lp->stopped = 1;
> lp->signalled = 0;
> }
> }
> *************** running_callback (struct lwp_info *lp, v
> *** 710,715 ****
> --- 722,862 ----
> return (lp->stopped == 0);
> }
>
> + /* Count the LWP's that have had events. */
> +
> + static int
> + count_events_callback (struct lwp_info *lp, void *data)
> + {
> + int *count = data;
> +
> + /* Count only threads that have a SIGTRAP pending. */
> + if (lp->status != 0 &&
> + WIFSTOPPED (lp->status) &&
> + WSTOPSIG (lp->status) == SIGTRAP &&
> + count != NULL) /* paranoia */
> + (*count)++;
> +
> + return 0;
> + }
> +
> + /* Select the LWP (if any) that is currently being single-stepped. */
> +
> + static int
> + select_singlestep_lwp_callback (struct lwp_info *lp, void *data)
> + {
> + if (lp->step)
> + return 1;
> + else
> + return 0;
> + }
> +
> + /* Select the Nth LWP that has had a SIGTRAP event. */
> +
> + static int
> + select_event_lwp_callback (struct lwp_info *lp, void *data)
> + {
> + int *selector = data;
> +
> + /* Select only threads that have a SIGTRAP event pending. */
> + if (lp->status != 0 &&
> + WIFSTOPPED (lp->status) &&
> + WSTOPSIG (lp->status) == SIGTRAP &&
> + selector != NULL) /* paranoia */
> + if ((*selector)-- == 0)
> + return 1;
> +
> + return 0;
> + }
> +
> + static int
> + cancel_breakpoints_callback (struct lwp_info *lp, void *data)
> + {
> + struct lwp_info *event_lp = data;
> +
> + if (lp != event_lp &&
> + lp->status != 0 &&
> + WIFSTOPPED (lp->status) &&
> + WSTOPSIG (lp->status) == SIGTRAP &&
> + breakpoint_inserted_here_p (read_pc_pid (lp->ptid) -
> + DECR_PC_AFTER_BREAK))
> + {
> + if (debug_lin_lwp)
> + {
> + fprintf_unfiltered (gdb_stdlog,
> + "Push back BP for %ld\n",
> + GET_LWP (lp->ptid));
> + }
> + /* For each lp except the event lp, if there was a trap,
> + set the PC to before the trap. */
> + if (DECR_PC_AFTER_BREAK)
> + {
> + write_pc_pid (read_pc_pid (lp->ptid) - DECR_PC_AFTER_BREAK,
> + lp->ptid);
> + }
> + lp->status = 0;
> + }
> + return 0;
> + }
> +
> + /* Select one LWP out of those that have events to be the event thread. */
> +
> + static void
> + select_event_lwp (struct lwp_info **orig_lp, int *status)
> + {
> + int num_events = 0;
> + int random_selector;
> + struct lwp_info *event_lp;
> +
> + /* Give preference to any LWP that is being single-stepped. */
> + event_lp = iterate_over_lwps (select_singlestep_lwp_callback, NULL);
> + if (event_lp != NULL)
> + {
> + if (debug_lin_lwp)
> + fprintf_unfiltered (gdb_stdlog,
> + "Select singlestep lwp %ld\n",
> + GET_LWP (event_lp->ptid));
> + }
> + else
> + {
> + /* No single-stepping LWP. Select one at random, out of those
> + which have had SIGTRAP events. */
> +
> + /* First see how many SIGTRAP events we have. */
> + iterate_over_lwps (count_events_callback, (void *) &num_events);
> +
> + /* OK, now randomly pick the Nth LWP of those that have had SIGTRAP. */
> + random_selector = (int)
> + ((num_events * (double) rand ()) / (RAND_MAX + 1.0));
> +
> + if (debug_lin_lwp)
> + {
> + if (num_events > 1)
> + fprintf_unfiltered (gdb_stdlog,
> + "Found %d SIGTRAP events, selecting #%d\n",
> + num_events, random_selector);
> + else if (num_events <= 0)
> + fprintf_unfiltered (gdb_stdlog,
> + "ERROR select_event_lwp %d events!\n",
> + num_events);
> + }
> +
> + event_lp = iterate_over_lwps (select_event_lwp_callback,
> + (void *) &random_selector);
> + }
> +
> + if (event_lp != NULL)
> + {
> + /* "event_lp" is now the selected event thread.
> + If any other threads have had SIGTRAP events, these events
> + must now be cancelled, so that the respective thread will
> + trip the breakpoint again once it is resumed. */
> + iterate_over_lwps (cancel_breakpoints_callback, (void *) event_lp);
> + *orig_lp = event_lp;
> + *status = event_lp->status;
> + event_lp->status = 0;
> + }
> + }
> +
> /* Return non-zero if LP has been resumed. */
>
> static int
> *************** lin_lwp_wait (ptid_t ptid, struct target
> *** 741,757 ****
> /* First check if there is a LWP with a wait status pending. */
> if (pid == -1)
> {
> ! /* Any LWP will do. */
> lp = iterate_over_lwps (status_callback, NULL);
> if (lp)
> {
> ! if (debug_lin_lwp)
> fprintf_unfiltered (gdb_stdlog,
> ! "Using pending wait status for LWP %ld.\n",
> GET_LWP (lp->ptid));
>
> - status = lp->status;
> - lp->status = 0;
> }
>
> /* But if we don't fine one, we'll have to wait, and check both
> --- 888,907 ----
> /* First check if there is a LWP with a wait status pending. */
> if (pid == -1)
> {
> ! /* Any LWP that's been resumed will do. */
> lp = iterate_over_lwps (status_callback, NULL);
> if (lp)
> {
> ! status = lp->status;
> ! lp->status = 0;
> ! if (debug_lin_lwp && status)
> fprintf_unfiltered (gdb_stdlog,
> ! "Using pending wait status %d for LWP %ld.\n",
> ! WIFSTOPPED (status) ? WSTOPSIG (status) :
> ! WIFSIGNALED (status) ? WTERMSIG (status) :
> ! WEXITSTATUS (status),
> GET_LWP (lp->ptid));
>
> }
>
> /* But if we don't fine one, we'll have to wait, and check both
> *************** lin_lwp_wait (ptid_t ptid, struct target
> *** 772,782 ****
> status = lp->status;
> lp->status = 0;
>
> ! if (debug_lin_lwp)
> ! if (status)
> ! fprintf_unfiltered (gdb_stdlog,
> ! "Using pending wait status for LWP %ld.\n",
> ! GET_LWP (lp->ptid));
>
> /* If we have to wait, take into account whether PID is a cloned
> process or not. And we have to convert it to something that
> --- 922,934 ----
> status = lp->status;
> lp->status = 0;
>
> ! if (debug_lin_lwp && status)
> ! fprintf_unfiltered (gdb_stdlog,
> ! "Using pending wait status %d for LWP %ld.\n",
> ! WIFSTOPPED (status) ? WSTOPSIG (status) :
> ! WIFSIGNALED (status) ? WTERMSIG (status) :
> ! WEXITSTATUS (status),
> ! GET_LWP (lp->ptid));
>
> /* If we have to wait, take into account whether PID is a cloned
> process or not. And we have to convert it to something that
> *************** lin_lwp_wait (ptid_t ptid, struct target
> *** 947,952 ****
> --- 1099,1111 ----
> /* This LWP is stopped now. */
> lp->stopped = 1;
>
> + /* MVS: so save its event_status. */
> + lp->status = status;
> + if (debug_lin_lwp)
> + fprintf_unfiltered (gdb_stdlog,
> + "LLW: Candidate event %d in %ld\n",
> + WSTOPSIG (status), GET_LWP (lp->ptid));
> +
> /* Now stop all other LWP's ... */
> iterate_over_lwps (stop_callback, NULL);
>
> *************** lin_lwp_wait (ptid_t ptid, struct target
> *** 954,964 ****
> longer running. */
> iterate_over_lwps (stop_wait_callback, NULL);
>
> /* If we're not running in "threaded" mode, we'll report the bare
> process id. */
>
> if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
> ! trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid)));
> else
> trap_ptid = null_ptid;
>
> --- 1113,1135 ----
> longer running. */
> iterate_over_lwps (stop_wait_callback, NULL);
>
> + /* MVS Now choose an event thread from among those that
> + have had events. Giving equal priority to all threads
> + that have had events helps prevent starvation. */
> +
> + select_event_lwp (&lp, &status);
> +
> /* If we're not running in "threaded" mode, we'll report the bare
> process id. */
>
> if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
> ! {
> ! trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid)));
> ! if (debug_lin_lwp)
> ! fprintf_unfiltered (gdb_stdlog,
> ! "LLW: trap_ptid is %ld\n",
> ! GET_LWP (trap_ptid));
> ! }
> else
> trap_ptid = null_ptid;
>