This is the mail archive of the
mailing list for the GDB project.
Re: [RFA] Factorize killing the children in linux-ptrace.c, and fix a 'process leak'.
On 11/09/2018 09:44 PM, Simon Marchi wrote:
> On 2018-11-03 4:19 p.m., Philippe Waroquiers wrote:
>> Running the gdb testsuite under Valgrind started to fail after 100+ tests,
>> due to out of memory caused by lingering processes.
>> The lingering processes are caused by the combination
>> of a limitation in Valgrind signal handling when using PTRACE_TRACEME
>> and a (minor) bug in GDB.
>> The Valgrind limitation is : when a process is ptraced and raises
>> a signal, Valgrind will replace the raised signal by SIGSTOP as other
>> signals are masked by Valgrind when executing a system call.
>> Removing this limitation seems far to be trivial, valgrind signal
>> handling is very complex.
>> Due to this valgrind limitation, GDB linux_ptrace_test_ret_to_nx gets
>> a SIGSTOP signal instead of the expected SIGTRAP or SIGSEGV.
>> In such a case, linux_ptrace_test_ret_to_nx does an early return, but
>> does not kill the child (running under valgrind), child stays in a STOP-ped
>> These lingering processes then eat the available system memory,
>> till launching a new process starts to fail.
>> This patch fixes the GDB minor bug by killing the child in case
>> linux_ptrace_test_ret_to_nx does an early return.
>> nat/linux-ptrace.c has 3 different logics to kill a child process.
>> So, this patch factorizes killing a child in the function kill_child.
>> The 3 different logics are:
>> * linux_ptrace_test_ret_to_nx is calling both kill (child, SIGKILL)
>> and ptrace (PTRACE_KILL, child, ...), and then is calling once
>> * linux_check_ptrace_features is calling ptrace (PTRACE_KILL, child, ...)
>> + my_waitpid in a loop, as long as the waitpid status was WIFSTOPPED.
>> * linux_test_for_tracefork is calling once ptrace (PTRACE_KILL, child, ...)
>> + my_waitpid.
>> The linux ptrace documentation indicates that PTRACE_KILL is deprecated,
>> and tells to not use it, as it might return success but not kill the tracee.
>> The documentation indicates to send SIGKILL directly.
>> I suspect that linux_ptrace_test_ret_to_nx calls both kill and ptrace just
>> to be sure ...
>> I suspect that linux_check_ptrace_features calls ptrace in a loop
>> to bypass the PTRACE_KILL limitation.
>> And it looks like linux_test_for_tracefork does not handle the PTRACE_KILL
>> Also, 2 of the 3 logics are calling my_waitpid, which seems better,
>> as this is protecting the waitpid syscall against EINTR.
>> So, the logic in kill_child is just using kill (child, SIGKILL)
>> + my_waitpid, and then does a few verifications to see everything worked
>> accordingly to the plan.
>> Tested on Debian/x86_64.
>> 2018-11-03 Philippe Waroquiers <email@example.com>
>> * nat/linux-ptrace.c (kill_child): New function.
>> (linux_ptrace_test_ret_to_nx): Use kill_child instead of local code.
>> Add a call to kill_child in case of early return after fork.
>> (linux_check_ptrace_features): Use kill_child instead of local code.
>> (linux_test_for_tracefork): Likewise.
> This makes sense to me, but I'd like to invoke pedro(), see if he sees
> anything wrong with it.
There's some long and ugly history around PTRACE_KILL vs SIGKILL.
See gdb/linux-nat.c kill_wait_one_lwp and kill_one_lwp, and the comments
within those functions. I don't know whether the kernels those were for
are relevant any more. Likely using git blame and finding the corresponding
patch posts would shed some more light. Whatever we do, it'd be nice to
make gdb/linux-nat.c use the unified code as well.