[RFA] Factorize killing the children in linux-ptrace.c, and fix a 'process leak'.

Fri Dec 14 23:39:00 GMT 2018

On 2018-11-11 12:24, Philippe Waroquiers wrote:
> On Sat, 2018-11-10 at 18:29 +0000, Pedro Alves wrote:
>> There's some long and ugly history around PTRACE_KILL vs SIGKILL.
>> See gdb/linux-nat.c kill_wait_one_lwp and kill_one_lwp, and the 
>> comments
>> within those functions.  I don't know whether the kernels those were 
>> for
>> are relevant any more.  Likely using git blame and finding the 
>> corresponding
>> patch posts would shed some more light.  Whatever we do, it'd be nice 
>> to
>> make gdb/linux-nat.c use the unified code as well.
> 
> I have done some digging in the source code and searched some
> more info using git blame.
> => I found 7 different linux logics related to kill(SIGKILL)
> and ptrace(PTRACE_KILL).
> I found only one identification of kernels giving problems.
> ChangeLogs/commit msg were not really giving much
> background information : these are telling what is being changed,
> not why it is being changed.
> linux-low.c has the most interesting comments explaining
> the kill related problems.
> 
> Below is some updated text with the found info (sorry, you
> will have to re-read some parts already in the RFA).
> 
> I have to admit that a 'linux-ptrace.c only factorization'
> as suggested in the RFA (at least as a first step, till we have
> validated with it that a simpler kill logic is ok on various platforms)
> is somewhat less frightening me than reducing all 7 different
> logics to a single one.
> (there is a table at the end of this mail that summarises all
> 7 logics).
> 
> I must say that all this is killing me :).
> Maybe I would feel better if I would just have added my own
> local new 8th kill logic where I need it :).
> 
> Philippe
> 
> 
> nat/linux-ptrace.c has 3 different logics to kill a child process.
> So, this patch factorizes killing a child in the function kill_child.
> 
> The 3 different logics are:
> * linux_ptrace_test_ret_to_nx is calling both kill (child, SIGKILL)
> Â  and ptrace (PTRACE_KILL, child, ...), and then is calling once
> Â  waitpid.
> * linux_check_ptrace_features is calling ptrace (PTRACE_KILL, child, 
> ...)
> Â  + my_waitpid in a loop, as long as the waitpid status was WIFSTOPPED.
> * linux_test_for_tracefork is calling once ptrace (PTRACE_KILL, child, 
> ...)
> Â  + my_waitpid.
> 
> The linux ptrace documentation indicates that PTRACE_KILL is 
> deprecated,
> and tells to not use it, as it might return success but not kill the 
> tracee.
> The documentation indicates to send SIGKILL directly.
> 
> I suspect that linux_ptrace_test_ret_to_nx calls both kill and ptrace 
> just
> to be sure ...
> git blame indicates that the call to kill(SIGKILL) was added in 
> addition to the
> pre-existing ptrace PTRACE_KILL. The commit log also indicates to 
> ignore
> PTRACE_KILL errors. But there is no explanation why adding the SIGKILL 
> call
> was needed, and why PTRACE_KILL was kept.

Not sure if you got that far in your archeology, but here is the 
discussion about it:

   https://sourceware.org/ml/archer/2011-q1/msg00102.html

which is a reply to:

   https://sourceware.org/ml/archer/2010-q3/msg00209.html

Realistically, we could probably just get rid of using PTRACE_KILL.

Please push the patch.  I think it's totally fine to do one small step 
at the time.  As you said, it's already a significant improvement.

Simon