[PATCH][gdb] Fix hang after ext sigkill

Wed Mar 25 10:29:54 GMT 2020

On 24-03-2020 16:35, Simon Marchi wrote:
> Hi Tom,
> 
> The test fails for me more often than not.  I've attached a gdb.log showing one such
> failure.  There seems to be a problem matching the output of the last "continue".
> 

Hi Simon,

I've managed to reproduce that, by running the test-case in parallel
with stress -c 5.

> This is what I see when I reproduce the case by hand:
> 
> (gdb) c
> Continuing.
> Couldn't get registers: No such process.
> Couldn't get registers: No such process.
> (gdb) [Thread 0x7ffff7d99700 (LWP 514079) exited]
> 
> Program terminated with signal SIGKILL, Killed.
> The program no longer exists.
> 

I've modified the test-case to check for the amount of "Couldn't get
registers: No such process." between the continue and the following gdb
prompt, and handle that appropriately. Hopefully this will fix the FAILs
you're seeing.

> I didn't really understand why we saw a prompt coming back before the other messages,
> so I looked into it a bit and this is what I think happens:
> 
> 1. While we are stopped at the prompt, the linux-nat event handler is unregistered from the event loop
> 2. Inferior gets SIGKILL'ed, so GDB gets SIGCHLD'ed, that posts an event to the event pipe, but since it's
>    not registered in the event loop, nothing happens
> 3. User does continue, the linux-nat event handler gets registered with the event loop.  Then, the continue
>    fails (No such process), which brings us back at the prompt.  However, the linux-nat event handler stays
>    registered with the event loop.
> 4. When we come back to the event loop, we process the event for the SIGKILL, which makes GDB print the
>    thread exit message and "Program terminated" message.
> 
> Normally, after the "continue" fails, I don't think we would want to leave the linux-nat
> handler registered with the event loop: if it was not registered before, why should it be
> registered after?  However, if it wasn't left there, we wouldn't see the messages saying
> the program has terminated, so that wouldn't be good either.
> 
> Maybe there's a better way to handle it?
> 

Thanks for the investigation. Unfortunately I'm not able to comment on this.

> However, I still think it would be a good idea to merge a patch like yours, it's already a
> step forward (especially since it fixes a regression).
> 
>> diff --git a/gdb/testsuite/lib/gdb.exp b/gdb/testsuite/lib/gdb.exp
>> index 81518b9646..745694df2d 100644
>> --- a/gdb/testsuite/lib/gdb.exp
>> +++ b/gdb/testsuite/lib/gdb.exp
>> @@ -571,7 +571,7 @@ proc runto { function args } {
>>  	    }
>>  	    return 1
>>  	}
>> -	-re "Breakpoint \[0-9\]*, \[0-9xa-f\]* in .*$gdb_prompt $" { 
>> +	-re "\[Bb\]reakpoint \[0-9\]*, \[0-9xa-f\]* in .*$gdb_prompt $" {
>>  	    if { $print_pass } {
>>  		pass $test_name
>>  	    }
>> diff --git a/gdb/thread.c b/gdb/thread.c
>> index 54b59e2244..9876ca3c76 100644
>> --- a/gdb/thread.c
>> +++ b/gdb/thread.c
>> @@ -1444,6 +1444,8 @@ scoped_restore_current_thread::restore ()
>>  
>>  scoped_restore_current_thread::~scoped_restore_current_thread ()
>>  {
>> +  if (m_inf == NULL)
>> +    return;
>>    if (!m_dont_restore)
>>      {
>>        try
>> @@ -1488,7 +1490,17 @@ scoped_restore_current_thread::scoped_restore_current_thread ()
>>        else
>>  	frame = NULL;
>>  
>> -      m_selected_frame_id = get_frame_id (frame);
>> +      try
>> +       {
>> +	 m_selected_frame_id = get_frame_id (frame);
>> +       }
>> +      catch (const gdb_exception &ex)
>> +       {
>> +	 m_inf = NULL;
>> +	 m_selected_frame_id = null_frame_id;
>> +	 m_selected_frame_level = -1;
>> +	 return;
>> +       }
> 
> The indentation is a bit off here.
> 
> Instead of clearing everything, I think we should just set m_selected_frame_id to
> null_frame_id and m_selected_frame_level to -1.  m_inf and m_thread can still be
> set.  This way, the right inferior will be restored, at least, I think this is
> desirable.  scoped_restore_current_thread::restore handles well the case where the
> thread to restore has exited.  The thread_info object is refcounted for this exact
> use case, where the thread would get deleted while
> 
> In other words:
> 
>       try
> 	{
> 	  m_selected_frame_id = get_frame_id (frame);
> 	  m_selected_frame_level = frame_relative_level (frame);
> 	}
>       catch (const gdb_exception &ex)
> 	{
> 	  m_selected_frame_id = null_frame_id;
> 	  m_selected_frame_level = -1;
> 	}

Yep, that works.

Here's the updated patch.

Thanks,
- Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-gdb-Fix-hang-after-ext-sigkill.patch
Type: text/x-patch
Size: 7593 bytes
Desc: not available
URL: <http://sourceware.org/pipermail/gdb-patches/attachments/20200325/448d1bc3/attachment-0001.bin>