[PATCH 5/8] gdb: don't resume vfork parent while child is still running
Andrew Burgess
aburgess@redhat.com
Thu Jun 22 13:17:25 GMT 2023
Like the last few commit, this fixes yet another vfork related issue.
Like the commit titled:
gdb: don't restart vfork parent while waiting for child to finish
which addressed a case in linux-nat where we would try to resume a
vfork parent, this commit addresses a very similar case, but this time
occurring in infrun.c. Just like with that previous commit, this bug
results in the assert:
x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
In this case the issue occurs when target-non-stop is on, but non-stop
is off, and again, schedule-multiple is on. As with the previous
commit, GDB is in follow-fork-mode child. If you have not done so, it
is worth reading the earlier commit as many of the problems leading to
the failure are the same, they just appear in a different part of GDB.
Here are the steps leading to the assertion failure:
1. The user performs a 'next' over a vfork, GDB stop in the vfork
child,
2. As we are planning to follow the child GDB sets the vfork_parent
and vfork_child member variables in the two inferiors, the
thread_waiting_for_vfork_done member is left as nullptr, that member
is only used when GDB is planning to follow the parent inferior,
3. The user does 'continue', our expectation is that the vfork child
should resume, and once that process has exited or execd, GDB should
detach from the vfork parent. As a result of the 'continue' GDB
eventually enters the proceed function,
4. In proceed we selected a ptid_t to resume, because
schedule-multiple is on we select minus_one_ptid (see
user_visible_resume_ptid),
5. As GDB is running in all-stop on top of non-stop mode, in the
proceed function we iterate over all threads that match the resume
ptid, which turns out to be all threads, and call
proceed_resume_thread_checked. One of the threads we iterate over
is the vfork parent thread,
6. As the thread passed to proceed_resume_thread_checked doesn't
match any of the early return conditions, GDB will set the thread
resumed,
7. As we are resuming one thread at a time, this thread is seen by
the lower layers (e.g. linux-nat) as the "event thread", which means
we don't apply any of the checks, e.g. is this thread a
vfork parent, instead we assume that GDB core knows what it's doing,
and linux-nat will resume the thread, we have now incorrectly set
running the vfork parent thread when this thread should be waiting
for the vfork child to complete,
8. Back in the proceed function GDB continues to iterate over all
threads, and now (correctly) resumes the vfork child thread,
8. As the vfork child is still alive the kernel holds the vfork
parent stopped,
9. Eventually the child performs its exec and GDB is sent and EXECD
event. However, because the parent is resumed, as soon as the child
performs its exec the vfork parent also sends a VFORK_DONE event to
GDB,
10. Depending on timing both of these events might seem to arrive in
GDB at the same time. Normally GDB expects to see the EXECD or
EXITED/SIGNALED event from the vfork child before getting the
VFORK_DONE in the parent. We know this because it is as a result of
the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
handle_vfork_child_exec_or_exit for details). Further the comment
in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
when we remain attached to the child (not the parent) we should not
expect to see a VFORK_DONE,
11. If both events arrive at the same time then GDB will randomly
choose one event to handle first, in some cases this will be the
VFORK_DONE. As described above, upon seeing a VFORK_DONE GDB
expects that (a) the vfork child has finished, however, in this case
this is not completely true, the child has finished, but GDB has not
processed the event associated with the completion yet, and (b) upon
seeing a VFORK_DONE GDB assumes we are remaining attached to the
parent, and so resumes the parent process,
12. GDB now handles the EXECD event. In our case we are detaching
from the parent, so GDB calls target_detach (see
handle_vfork_child_exec_or_exit),
13. While this has been going on the vfork parent is executing, and
might even exit,
14. In linux_nat_target::detach the first thing we do is stop all
threads in the process we're detaching from, the result of the stop
request will be cached on the lwp_info object,
15. In our case the vfork parent has exited though, so when GDB
waits for the thread, instead of a stop due to signal, we instead
get a thread exited status,
16. Later in the detach process we try to resume the threads just
prior to making the ptrace call to actually detach (see
detach_one_lwp), as part of the process to resume a thread we try to
touch some registers within the thread, and before doing this GDB
asserts that the thread is stopped,
17. An exited thread is not classified as stopped, and so the assert
triggers!
Just like with the earlier commit, the fix is to spot the vfork parent
status of the thread, and not resume such threads. Where the earlier
commit fixed this in linux-nat, in this case I think the fix should
live in infrun.c, in proceed_resume_thread_checked. This function
already has a similar check to not resume the vfork parent in the case
where we are planning to follow the vfork parent, I propose adding a
similar case that checks for the vfork parent when we plan to follow
the vfork child.
This new check will mean that at step #6 above GDB doesn't try to
resume the vfork parent thread, which prevents the VFORK_DONE from
ever arriving. Once GDB sees the EXECD/EXITED/SIGNALLED event from
the vfork child GDB will detach from the parent.
There's no test included in this commit. In a subsequent commit I
will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed.
If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive. I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
---
gdb/infrun.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/gdb/infrun.c b/gdb/infrun.c
index b10b7c59652..3eb3b290001 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -3299,10 +3299,12 @@ proceed_resume_thread_checked (thread_info *tp)
}
/* When handling a vfork GDB removes all breakpoints from the program
- space in which the vfork is being handled, as such we must take care
- not to resume any thread other than the vfork parent -- resuming the
- vfork parent allows GDB to receive and handle the 'vfork done'
- event. */
+ space in which the vfork is being handled. If we are following the
+ parent then GDB will set the thread_waiting_for_vfork_done member of
+ the parent inferior. In this case we should take care to only resume
+ the vfork parent thread, the kernel will hold this thread suspended
+ until the vfork child has exited or execd, at which point the parent
+ will be resumed and a VFORK_DONE event sent to GDB. */
if (tp->inf->thread_waiting_for_vfork_done != nullptr)
{
if (target_is_non_stop_p ())
@@ -3334,6 +3336,20 @@ proceed_resume_thread_checked (thread_info *tp)
}
}
+ /* When handling a vfork GDB removes all breakpoints from the program
+ space in which the vfork is being handled. If we are following the
+ child then GDB will set vfork_child member of the vfork parent
+ inferior. Once the child has either exited or execd then GDB will
+ detach from the parent process. Until that point GDB should not
+ resume any thread in the parent process. */
+ if (tp->inf->vfork_child != nullptr)
+ {
+ infrun_debug_printf ("[%s] thread is part of a vfork parent, child is %d",
+ tp->ptid.to_string ().c_str (),
+ tp->inf->vfork_child->pid);
+ return;
+ }
+
infrun_debug_printf ("resuming %s",
tp->ptid.to_string ().c_str ());
--
2.25.4
More information about the Gdb-patches
mailing list