This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] posix: Do not use WNOHANG in waitpid call for Linux posix_spawn
On 23/10/2017 08:38, Szabolcs Nagy wrote:
> On 23/10/17 07:32, Florian Weimer wrote:
>> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote:
>>> As shown in some buildbot issues on aarch64 and powerpc, calling
>>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child
>>> is ready to be collected. This patch changes the call back to 0
>>> as before fe05e1cb6d64 fix.
>>
>> I see it on x86-64, too. It does look like a kernel bug.
>>
>>> This change can lead to the scenario 4.3 described in the commit,
>>> where the waitpid call can hang undefinitely on the call. However
>>> this is also a very unlikely and also undefinied situation where
>>> both the caller is trying to terminate a pid before posix_spawn
>>> returns and the race pid reuse is triggered. I don't see how to
>>> correct handle this specific situation within posix_spawn.
>>
>> Agreed. I wish we could do better here, but it seems we can't.
>>
>
> musl writes a close-on-exec pipe in the child on error,
> reading it in the parent tells if the child died before
> exec or not. (so the waitpid can be made precise)
>
> VFORK is not precise (i think under ptrace parent can
> continue before child execs) so using close-on-exec
> fd is a better way to sync with exec.
>
That was my first approach and Rasmus Villemoes proposed to use
CLONE_VFORK to improve the error reporting on 4b4d4056bb1 [1].
Rich raised this very issue back then [2] and Andreas replied
that with recent supported kernels this is not an issue [3]
(at least on CentOS 6 it seems to work properly).
Also, this approach have the advantage of using less resources
and fewer chanches for the spawn to fail (due EMFILE).
[1] https://sourceware.org/ml/libc-alpha/2016-09/msg00360.html
[2] https://sourceware.org/ml/libc-alpha/2016-09/msg00559.html
[3] https://sourceware.org/ml/libc-alpha/2016-09/msg00561.html