This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] gdb: Fix instability in thread groups test


* Pedro Alves <palves@redhat.com> [2018-08-13 10:51:44 +0100]:

> On 08/10/2018 10:26 PM, Simon Marchi wrote:
> > On 2018-08-10 05:57, Andrew Burgess wrote:
> >> In the test script gdb.mi/list-thread-groups-available.exp we ask GDB
> >> to list all thread groups, and match the output against a
> >> regexp. Occasionally, I would see this test fail.
> >>
> >> The expected output is a list of entries, each entry looking roughly
> >> like this:
> >>
> >>   {id="<DECIMAL>",type="process",description="<STRING>",
> >>    user="<STRING>",cores=["<DECIMAL>","<DECIMAL>",...]}
> >>
> >> All the fields after 'id' and 'type' are optional, and the 'cores'
> >> list can contain 1 or more "<DECIMAL>" entries.
> >>
> >> On my machine (Running Fedora 27, kernel 4.17.3-100.fc27.x86_64)
> >> usually the 'description' is a non-empty string, and the 'cores' list
> >> has at least one entry in it.  But sometimes, very rarely, I'll see an
> >> entry in the process group list where the 'description' is an empty
> >> string, the 'user' is the string "?", and the 'cores' list is empty.
> >> Such an entry looks like this:
> >>
> >>    {id="19863",type="process",description="",user="?",cores=[]}
> >>
> >> I believe that this is caused by the process exiting while GDB is
> >> scanning /proc for process information.  The current code in
> >> gdb/nat/linux-osdata.c is not (I think) resilient against exiting
> >> processes.
> >>
> >> This commit adjusts the regex that matches the 'cores' list so that an
> >> empty list is acceptable, with this patch in place the test script
> >> gdb.mi/list-thread-groups-available.exp never fails for me now.
> >>
> >> gdb/testsuite/ChangeLog:
> >>
> >>     * gdb.mi/list-thread-groups-available.exp: Update test regexp.
> >> ---
> >>  gdb/testsuite/ChangeLog                               | 4 ++++
> >>  gdb/testsuite/gdb.mi/list-thread-groups-available.exp | 2 +-
> >>  2 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> index c4dab2a2c34..88f9ee9b63d 100644
> >> --- a/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> +++ b/gdb/testsuite/gdb.mi/list-thread-groups-available.exp
> >> @@ -45,7 +45,7 @@ set id_re "id=\"$decimal\""
> >>  set type_re "type=\"process\""
> >>  set description_re "description=\"$string_re\""
> >>  set user_re "user=\"$string_re\""
> >> -set cores_re "cores=\\\[\"$decimal\"(,\"$decimal\")*\\\]"
> >> +set cores_re "cores=\\\[(\"$decimal\"(,\"$decimal\")*)?\\\]"
> >>
> >>  # List all available processes.
> >>  set process_entry_re
> >> "{${id_re},${type_re}(,$description_re)?(,$user_re)?(,$cores_re)?}"
> > 
> > Hi Andrew,
> > 
> > The patch LGTM.  I manually reproduced this case by spawning a process (tail -f /dev/null) and noting its pid.  In linux_xfer_osdata_processes, I added:
> > 
> >   if (pid == <pid>)
> >     sleep (5);
> > 
> > and killing the process during that sleep.
> 
> But shouldn't we make GDB handle this better?  Make the output
> more "atomic" in the sense that we either show a valid complete
> entry, or no entry?  There's an inherent race
> here, since we use multiple /proc accesses to fill up a process
> entry.  If we start fetching process info for a process, and the process
> disappears midway, I'd think it better to discard that process's entry,
> as-if we had not even seen it, i.e., as if we had listed the set of
> processes a tiny moment later.

I agree.

We also need to think about process reuse.  So with multiple accesses
to /proc we might start with one process, and end up with a completely
new process.

I might be overthinking it, but my first guess at a reliable strategy
would be:

  1. Find each /proc/PID directory.
  2. Read /proc/PID/stat and extract the start time.  Failure to read
     this causes the process to be abandoned.
  3. Read all of the other /proc/PID/XXX files as needed.  Any failure
     results in the process being abandoned.
  4. Reread /proc/PID/stat and confirm the start time hasn't changed,
     this would indicate a new process having slipped in.

Given the system is still running, we can never be sure that we have
"all" processes, so throwing out anything that looks wrong seems like
the right strategy.

Also in step #4 we know we've just missed a process - something new
has started, but we ignore it.  I think this is fine though given the
racy nature of this sort of thing...

The only question is, could these thoughts be dropped into a bug
report, and the original patch to remove the unstable result applied?
Or maybe the test updated to either PASS or KFAIL?

Thanks,
Andrew


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]