RFC: pthread pid accessor (BZ# 27880)

Tue Jun 1 14:09:51 GMT 2021

On 31/05/2021 13:51, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> It seems that this is trickier than it seems, some issues we might consider 
>> first:
>>
>>   1. What should we do with detached threads? As for pthread_kill, issuing 
>>      a pthread_gettid_np might use an invalid handler (since the pthread_t 
>>      identifier might be reused).  This only solution I have is to define 
>>      it as undefined behavior, this is not great but to proper support it
>>      would incur to keep tracking or all possible pthread_t identifiers 
>>      (we already keep the glibc provided stacks, dl_stack_cache, so it 
>>       would be a matter to include the user provided one in the list as 
>>      special entries).
> 
> Detached threads are fine as long as the thread is still running.  This
> is something the application can ensure using synchronization.
> 
> There are other interfaces with this property, including pthread_kill.

Afaik pthread_kill detaches created threads or thread that call pthread_detach
are not really defined (the thread ID lifetime ends when detached is issued).
We even have a bug report for this, BZ #19193.

But currently calling pthread_kill is already undefined: it accesses 
the internal tid file without any extra check.  Even using the
INVALID_NOT_TERMINATED_TD_P/INVALID_TD_P won't really improve thing, since
might still access invalid memory if the thread cache was empty and the
resulted 'struct thread' was deallocated.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=19193

> 
>>   2. I think that once we provide this API, developers will start to use to
>>      query if a thread is alive and I am not sure if this is really the 
>>      proper API for this. This is the same issue as 1.
> 
> They probably use pthread_kill with a zero signal for that today.
> Here's an example for httpd:
> 
> |         /* deal with a rare timing window which affects waking up the
> |          * listener thread...  if the signal sent to the listener thread
> |          * is delivered between the time it verifies that the
> |          * listener_may_exit flag is clear and the time it enters a
> |          * blocking syscall, the signal didn't do any good...  work around
> |          * that by sleeping briefly and sending it again
> |          */
> | 
> |         iter = 0;
> |         while (iter < 10 &&
> | #ifdef HAVE_PTHREAD_KILL
> |                pthread_kill(*listener_os_thread, 0)
> | #else
> |                kill(ap_my_pid, 0)
> | #endif
> |                == 0) {
> |             /* listener not dead yet */

Right, I thing the newer interface might work for non detached or
threads that are not yet joined.

> 
>>   3. How do we handle the concurrent access between pthread_join and
>>   pthread_gettid_np? Once a pthread_join is issued, the pthread_t
>>   identifier might be reused and accessing it should be
>>   invalid. pthread_join first synchronizes using 'joinid' to avoid
>>   concurrent pthread_join and then wait the kernel signal on 'tid'
>>   that the thread has finished.  The straightforward
>>   'pthread_gettid_np' implementation would do a atomic load on tid,
>>   however it might read a transient value between pthread_join
>>   'joinid' setup and the futex wait.  I am not sure how to handle it
>>   correctly.
> 
> The application must ensure through synchronization that the lifetime of
> the thread handle has not ended yet.  Concurrent calls with pthread_join
> is fine as long as the thread has not exited yet (same as for
> pthread_kill).
> 
> The question is what we should do after thread exit, but with a joinable
> thread.  I think for that we should return the original TID the kernel
> assigned (even though it could not be reused).  That would strongly
> discourage the unsafe probing behavior because the function cannot be
> used to probe if the thread is still running.

Do you mean between the thread cancel/exit and kernel reset the struct
thread 'tid' field?  The main problem is the thread might be detached
between, that's why pthread_join synchronizes first using the 'joinid'
field.

But I think there is no much we can do it besides a simple atomic
load on struct thread 'tid'.  Trying to synchronize with 'joinid'
won't really help, since we 'pthread_detach' can't fail (not with
an intermittent error).  We might try to use either a busy wait or
a lock on pthread_deatch and pthread_join over 'joinid', but I don't
think this really solves much without introducing potential other
latency issues.

Peter has suggested to return zero or -1 with ESRCH if the pthread
is detached from its underlying kernel thread, but I think 
INVALID_NOT_TERMINATED_TD_P is not valid for detached threads since
the struct thread ownership might be invalid at the time of the
call. So I think we should just make it undefined behavior and not
making any assumptions.

> 
>> Also, MacOSX signature is:
>>
>>   int pthread_gettid_np (pthread_t thread, uint64_t *thread_id)
>>
>> And it returns the current thread identification if THREAD is NULL, returns
>> ESRCH for invalid handle (the 1. and 2. issue below), and also consults 
>> the kernel if the identifier can no be obtained.
> 
> Macos calls the interface pthread_threadid_np, actually.  It looks as if
> it returns a truly unique number that isn't reused within the process or
> system.  A Linux TID wouldn't be like that, so I think we should call
> the interface something else.

Fair enough, bionic has 

  pid_t pthread_gettid_np(pthread_t t)

So I think we might be an option. It basically returns the underlying
kernel process identifier, no extra guarantee as done by MacOSX 
implementation.