[PATCH v4 02/21] nptl: Fix Race conditions in pthread cancellation [BZ#12683]

Wed Apr 8 14:13:13 GMT 2020

On 07/04/2020 15:24, Zack Weinberg wrote:
> On Fri, Apr 3, 2020 at 4:32 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> This patch is the initial fix for race conditions in NPTL cancellation
>> code by redefining how cancellable syscalls are defined and handled.
>> The current buggy approach is to enable asynchronous cancellation
>> before making the syscall and restore the previous cancellation
>> type once the syscall returns.
> 
> I want to see this bug fixed.  Unfortunately I don't know the guts of
> NPTL well enough to review your patches completely, but here are a few
> things I noticed:

Thanks.

> 
>> As a side note regarding SIGCANCEL and SIGTIMER being the the same,
>> it should not impact timer_create functionality.  It arranges for
>> SIGCANCEL/SIGTIMER to be sent to the internal helper thread, which
>> in turn check if the si.si_code is SI_TIMER and call pthread_exit
>> otherwise (sysdeps/unix/sysv/linux/timer_routines.c:129).
> 
> Can we be absolutely certain that SIGCANCEL/SIGTIMER will always be
> sent to a specific thread and not to a process?

With this patch pthread_cancel calls __pthread_kill_internal and
in turn it will use tgkill with process's pid and struct pthread tid.

And Linux timer_create does use the helper thread tid as the target one
(line 148).

> 
>> +  /* Add SIGCANCEL on ignored sigmask to avoid the handler to be called
>> +     again.  */
>> +  ucontext_block_sigcancel (ctx);
>> +
>> +  /* Check if asynchronous cancellation mode is set or if interrupted
>> +     instruction pointer falls within the cancellable syscall bridge.  For
>> +     interruptable syscalls that might generate external side-effects (partial
>> +     reads or writes, for instance), the kernel will set the IP to after
>> +     '__syscall_cancel_arch_end', thus disabling the cancellation and allowing
>> +     the process to handle such conditions.  */
>> +  if (self->canceltype == PTHREAD_CANCEL_ASYNCHRONOUS
>> +      || cancellation_pc_check (ctx))
>> +    __do_cancel (PTHREAD_CANCELED);
> 
> Shouldn't this check happen _before_ we block further SIGCANCELs?
> If cancellation_pc_check fails, because the signal was delivered on
> exit from a system call that has had side effects, don't we need to
> be able to receive future SIGCANCELs in order for the next cancellation
> point to trigger?

For this case the cancellation will happen on next cancellation entrypoint
since pthread_cancel has already set the CANCELED_BIT on the target 
struct pthread cancelhandling member and it will be checked on every 
cancellation entrypoint prior issuing the syscall.

> 
>>    /* Install the cancellation signal handler.  If for some reason we
>>       cannot install the handler we do not abort.  Maybe we should, but
>>       it is only asynchronous cancellation which is affected.  */
> 
> 1) I think the third sentence of this comment has always been wrong.
> 2) Perhaps, if we cannot install a handler for SIGCANCEL, we should set
>    a global flag which causes all calls to pthread_cancel to fail?
> 

The cases I see it might fail are if kernel or some filter (seccomp, bpf,
etc.) explicit disables it. And I am not sure if is up to glibc to actually
handle it, since it would require to:

  1. Set a global flag stating that *asynchronous* thread cancellation is
     not supported at runtime, as you noted.

  2. The pthread_setcanceltype should the check this flag and fail for 
     PTHREAD_CANCEL_ASYNCHRONOUS.

  3. We will need to adjust the code on thread start (START_THREAD_DEFN)
     which set PTHREAD_CANCEL_ASYNCHRONOUS (and as Florian noted, this
     code might be unnecessary).

  4. We will also need to adjust on how to handle
     pthread_cleanup_push_defer_np since __pthread_register_cancel_defer
     also set and reset cancellation type to PTHREAD_CANCEL_ASYNCHRONOUS.

So I am not sure which would be better: 

  1. Do nothing (currently).

  2. Abort in case of an handler installation error.

  3. Disable asynchronous cancellation at runtime.

In any case I think we should discuss in in another thread.

>> +    /* Install the handle to change the threads' uid/gid.  */
> 
> Typo: handle -> handler (it was wrong before, but you may as well fix it)

Ack.

> 
>> +    struct sigaction sa;
>> +    __sigemptyset (&sa.sa_mask);
>> +    sa.sa_sigaction = sighandler_setxid;
>> +    sa.sa_flags = SA_SIGINFO | SA_RESTART;
>> +    __libc_sigaction (SIGSETXID, &sa, NULL);
> 
> Unrelated preexisting bug, but I think we probably _should_ crash the
> whole process if the SIGSETXID handler cannot be installed,
> particularly when __libc_enable_secure is true.

I tend to agree, but is there a real scenario where sigaction syscall
might fail here?

> 
>> +  /* Avoid signaling when thread attempts cancel itself (pthread_kill
>> +     is expensive).  */
>> +  if (pd == THREAD_SELF)
>> +    {
>> +      if (pd->cancelstate == PTHREAD_CANCEL_ENABLE
>> +         && pd->canceltype == PTHREAD_CANCEL_ASYNCHRONOUS)
>> +       __pthread_exit (PTHREAD_CANCELED);
>> +      return 0;
> 
> This works because __pthread_exit is actually a tiny wrapper around
> __do_cancel, but I think the logic would be easier to understand,
> here, if it called __do_cancel.

The idea is eventually just remove __do_cancel and use __pthread_exit
instead internally.  I didn't do it on this patch because it would 
require first to move __pthread_exit to libc.so.