This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/4] nptl: Handle EPIPE on tst-cancel2



On 23/08/2019 09:47, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 20/08/2019 12:30, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and 
>>>> pthread_cancel you can see this issue more clearly (dump with strace):
>>>>
>>>> [pid  2587] set_robust_list(0x7fffabccf290, 24) = 0
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...>
>>>> [pid  2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0
>>>>
>>>> ########### Cancellation start to act here, by loading the libgcc to unwinding
>>>> [pid  2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
>>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0
>>>> [pid  2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000
>>>> [pid  2586] close(5)                    = 0
>>>> [pid  2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5
>>>> [pid  2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832
>>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0
>>>> [pid  2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000
>>>> [pid  2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000
>>>> [pid  2586] close(5)                    = 0
>>>> [pid  2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0
>>>> [pid  2586] munmap(0x7fffabf00000, 63776) = 0
>>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>>>> [pid  2586] close(3)                    = 0
>>>> [pid  2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...>
>>>>
>>>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called
>>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>>>> [pid  2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0
>>>>
>>>> ########### No side-effects reported back to program
>>>> [pid  2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0
>>>> [pid  2587] exit(0)                     = ?
>>>>
>>>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails
>>>> depending whether the write is interrupted or not by the cancellation signal.
>>>
>>> Hmm.  Which cancellation implementation is this?  At which point in the
>>> trace do we start unwinding?  I'm surprised that strace reports the
>>> EPIPE before the SIGPIPE, but maybe that's just a kernel race.  My
>>> expectation is that the current code unwinds after the system call
>>> returns with the EPIPE error, never returning it to the application.  I
>>> think this is the right behavior for the write system call.
>>
>> This is current implement, more specifically glibc 2.17, CentOS 7.6 on
>> powerpc64le.  From the trace :
>>
>>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>>
>> This where pthread_cancel sends the SIGCANCEL signal to thread.
>>
>>>> [pid  2586] close(3)                    = 0
>>
>> This is the
>>
>>     /* This will cause the write in the child to return.  */
>>     close (fd[0]);
>>
>> In tst-cancel2.c.
>>
>> And finally:
>>
>>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>>
>> SIGPIPE is receives, making the write fail with EPIPE and then
>>
>>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>>
>> sigcancel_handler is issued.  And the implementation *does* unwind after the
>> syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683).
> 
> I looked at this some more.  It seems that in theory, we could also see
> EPIPE in the old implementation or at least observe execution of a
> signal handler, and the asynchronous cancel hides that.
> 
> In the new implementation, if there is a signal handler and a signal for
> it is delivered before SIGCANCEL (even if the signal was sent *after*
> pthread_cancel, from the same thread), I do not think there is any
> chance whatsoever that we can hide the behavior difference.  After all,
> SIGCANCEL may not trigger an asynchronous cancellation from the signal
> handler.

Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled
by the signal handle by removing it from ucontext_t signal mask.
> 
> System calls that are cancellation points appear to fall into these
> categories:
> 
> (A) Those that do not perform any resource allocation (write, read).
> 
> (B) Those that perform resource allocation, but the allocation can be
>     easily reverted (openat, accept4).
> 
> (C) Those that perform resource allocation, but the allocation is
>     difficult to undo (recvmsg with descriptor passing).
> 
> (D) close.
> 
> For (A), I think POSIX allows sufficient wiggle room so that we can
> exhibit a partial side effect and still act on the cancellation (without
> reporting a partial write first to the application).

I agree, the side-effects for such syscalls should not lead to possible
resources leakage.

> 
> For (B), maybe we should undo the resource allocation and then proceed
> to act on the cancellation.

Besides this requires a lot of more complexity by mapping what kind of
resources each syscall would require to free and when to free based on
returned codes, it would hide it from the application. My view we just
need to return that the syscall has visible side-effects that may result
in system resources leaks and let the application deal with.

> 
> For (C), the complexity may not be worth it.
> 
> For (A), (B), (C), we can act on the cancellation in the error case,
> after we observe the cancellation flag in the signal handler trampoline.
> (I think dropping the EINTR restriction from there achieves that.)  If
> we do that, we do not need to change the test case.

I don't see a real net gain in adding such complexity.  It is not on the 
error  case, but rather when visible side-effects has happened and we should
alert the program to act upon that. The tst-cancel2 case is an error case
because of the EPIPE semantic, but the new tst-cancel28 (based on the leak
example from BZ#12683) is a case where the open call returns a valid file
descriptor that should be freed.

If glibc starts to add extra semantics for cancellation, we will have a standard
extensions since now 'open', for instance, would close the file descriptor if 
the cancellation signal is delivered after the OS allocates the file descriptor 
resource.  IMHO this is clearly what the program itself should be handling, 
because it might the case where it is prepared to close itself the file descriptor
in a different thread.  This would lead to possible another incompatibilities 
regarding libc system implementation, besides adding more complexity.

I would like to make cancellation racy-free with minimal semantic change as 
possible.

> 
> (D) is very special.  Ideally, we would specify what happens with the
> descriptor if the close call is canceled.  POSIX does not even specify
> what the state of file descriptor is after an EINTR error, so it doesn't
> say what happens with cancellation, either.  Maybe we have to leave that
> undefined.

Indeed even after the POSIX close language clarification [1], the specification 
is messy (it really does not help that some system does have interruptible
close implementation that might return -1).

However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux
since it shouldn't return -1/EINTR in any case (and I don't know if/when
there are kernels that actually does it) [2].  So assuming the nptl is Linux 
specific, we can assume that, although close is specified as a cancellable
entrypoint, it won't happen in practice.

[1] http://austingroupbugs.net/view.php?id=529
[2] https://lwn.net/Articles/576478/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]