Bug 26570 - One of the thread is getting hanged when two threads are trying yo join each other.
Summary: One of the thread is getting hanged when two threads are trying yo join each ...
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: 2.31
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-03 08:30 UTC by Shashank Agarwal
Modified: 2020-09-04 08:01 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Test Case describing the scenario (865 bytes, text/x-csrc)
2020-09-03 08:30 UTC, Shashank Agarwal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Shashank Agarwal 2020-09-03 08:30:22 UTC
Created attachment 12814 [details]
Test Case describing the scenario

Hello,

Description -

I have obeserved a behaviour in pthread_join(3) where 2 threads are trying to join each other and one of the thread is getting hanged indefinitely as it stuck in the futex(2) system call for the other thread to wake.

Test Case Description -

1. Create multiple threads using pthread_create in the joinable state.
2. thread1 joins thread2 using pthread_join.
3. thread2 joins thread1 using pthread_join.

EXPECTED RESULT :
1. Thread1 should join thread2.
2. when thread2 tries to join thread1 it detects a deadlock.

Actual Result :
Thread 2 is getting hanged.

Glibc Version -
2.31, 2.28

Sample Code -
pthread_join_sample.c attached

Enviorenment -
$uname -a
Linux a1sb2-010 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Root Caues -

When any thread performs the pthread_join(3) on any thread, Glibc stores the ID of the former thread(performing pthread_join(3)) into the member joinid of the structure pthread of the thread on which pthread_join(3) is performed.

Hence in the expected scenario pthread_join(3) code flow works as follows -

1. thread t1 performing pthread_join(3) on thread t2, GLibc performs below checks before invoking futex wait-

Check 1 -
If the joinid of t1 contains the id of the t2 then it returns the EDEADLK if not then moves to check 2,

Check 2

Glibc checks wheather anyother thread has already performed the pthread_join on t2 or not. This is achieved with the help of joinid member of the thread t2. If this member contains the NULL value it means no other thread has performed the pthread_join on t2 and thread t1 updates its ID into the joinid member of structure pthread of t2 in nptl/pthread_join_common.c and perform futex wait. If the joinid member of thread t2 already contain some non null value it means some other thread has already performed pthread_join(3) on thread t2 and library returns EINVAL to the thread t1.

2. When thread t2 performs pthread_join(3) on t1, Glibc fails in check 2 i.e joinid of t2 contains the id of the t1 and it returns the EDEADLK.

As per analysis In Glibcv2.31, thread performing pthread_join(3) on any thread is not able to update the ID of the former thread into the member joinid of the other thread, Hence both the threads pass check 1 and check 2 and gone into the futex wait and one thread gets hanged. In glibc 2.31 below code semantics is used to update the ID of the thread -

atomic_compare_exchange_weak_acquire (&pd->joinid,&self,NULL)
This macro internally calls the builtin function __atomic_compare_exchange_n ((mem), (expected), (desired), 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) of gcc. This built-in function implements an atomic compare and exchange operation. This compares the contents of *ptr with the contents of *expected. If equal, the operation is a read-modify-write operation that writes desired into *ptr.
If they are not equal, the operation is a read and the current contents of *ptr are written into *expected.

The expected value passed to this builtin function is &self i.e the ID of the thread calling pthread_join(3). So as per our analysis joinid of the thread on which pthread_join(3) is performed will never be equal to the ID of the thread calling pthread_join(3) as its initial value will by NULL and this member joinid is always updated from pthread_join(3) hence in builtin function __atomic_compare_exchange_n the comaprision will always gets failed and the joinid will never be updated. Both the threads will go into the futex wait operation.

Fix -

In order to fix the issue, expected value passed to the function __atomic_compare_exchange_n should be null and the desired value should be the ID of the thread performing pthread_join(3) as below -

atomic_compare_exchange_weak_acquire (&pd->joinid,&null_ptr,self)

Regards
Shashank Agarwal
HCL Technologies Limited
shashank_agarwal@hcl.com
Comment 1 Adhemerval Zanella 2020-09-03 13:08:23 UTC
(In reply to Shashank Agarwal from comment #0)
> Created attachment 12814 [details]
> Test Case describing the scenario
> 
> Hello,
> 
> Description -
> 
> I have obeserved a behaviour in pthread_join(3) where 2 threads are trying
> to join each other and one of the thread is getting hanged indefinitely as
> it stuck in the futex(2) system call for the other thread to wake.
> 
> Test Case Description -
> 
> 1. Create multiple threads using pthread_create in the joinable state.
> 2. thread1 joins thread2 using pthread_join.
> 3. thread2 joins thread1 using pthread_join.
> 
> EXPECTED RESULT :
> 1. Thread1 should join thread2.
> 2. when thread2 tries to join thread1 it detects a deadlock.

The deadlock failure *may* be reportable by a pthread_join implementation, but it is not portable solution and relying on such behaviour is a fragile approach.  The glibc tries only to catch some possible scenarios which are the most straighforward, but there multiple other ones that current implementation does not handle.

A fully reliable deadlock implementation will either require some extra kernel support or a costly userspace implementation to maintain a user list per futex word (which I am not fully sure it is feasible).

> 
> Actual Result :
> Thread 2 is getting hanged.
> 
> Glibc Version -
> 2.31, 2.28
> 
> Sample Code -
> pthread_join_sample.c attached
> 
> Enviorenment -
> $uname -a
> Linux a1sb2-010 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux
> 
> Root Caues -
> 
> When any thread performs the pthread_join(3) on any thread, Glibc stores the
> ID of the former thread(performing pthread_join(3)) into the member joinid
> of the structure pthread of the thread on which pthread_join(3) is performed.
> 
> Hence in the expected scenario pthread_join(3) code flow works as follows -
> 
> 1. thread t1 performing pthread_join(3) on thread t2, GLibc performs below
> checks before invoking futex wait-
> 
> Check 1 -
> If the joinid of t1 contains the id of the t2 then it returns the EDEADLK if
> not then moves to check 2,
> 
> Check 2
> 
> Glibc checks wheather anyother thread has already performed the pthread_join
> on t2 or not. This is achieved with the help of joinid member of the thread
> t2. If this member contains the NULL value it means no other thread has
> performed the pthread_join on t2 and thread t1 updates its ID into the
> joinid member of structure pthread of t2 in nptl/pthread_join_common.c and
> perform futex wait. If the joinid member of thread t2 already contain some
> non null value it means some other thread has already performed
> pthread_join(3) on thread t2 and library returns EINVAL to the thread t1.
> 
> 2. When thread t2 performs pthread_join(3) on t1, Glibc fails in check 2 i.e
> joinid of t2 contains the id of the t1 and it returns the EDEADLK.
> 
> As per analysis In Glibcv2.31, thread performing pthread_join(3) on any
> thread is not able to update the ID of the former thread into the member
> joinid of the other thread, Hence both the threads pass check 1 and check 2
> and gone into the futex wait and one thread gets hanged. In glibc 2.31 below
> code semantics is used to update the ID of the thread -
> 
> atomic_compare_exchange_weak_acquire (&pd->joinid,&self,NULL)
> This macro internally calls the builtin function __atomic_compare_exchange_n
> ((mem), (expected), (desired), 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) of
> gcc. This built-in function implements an atomic compare and exchange
> operation. This compares the contents of *ptr with the contents of
> *expected. If equal, the operation is a read-modify-write operation that
> writes desired into *ptr.
> If they are not equal, the operation is a read and the current contents of
> *ptr are written into *expected.
> 
> The expected value passed to this builtin function is &self i.e the ID of
> the thread calling pthread_join(3). So as per our analysis joinid of the
> thread on which pthread_join(3) is performed will never be equal to the ID
> of the thread calling pthread_join(3) as its initial value will by NULL and
> this member joinid is always updated from pthread_join(3) hence in builtin
> function __atomic_compare_exchange_n the comaprision will always gets failed
> and the joinid will never be updated. Both the threads will go into the
> futex wait operation.
> 
> Fix -
> 
> In order to fix the issue, expected value passed to the function
> __atomic_compare_exchange_n should be null and the desired value should be
> the ID of the thread performing pthread_join(3) as below -
> 
> atomic_compare_exchange_weak_acquire (&pd->joinid,&null_ptr,self)

This is wrong, the 'joinid' here is used to synchronize with the pthread_detach to indicate a waiter.  The pthread_create initializes the field to 'NULL' for default case or a self-reference to its own 'struct pthread' for the case of a detached attribute. 

This CAS checks if pthread_join is being called for a detached thread, where 'joinid' refers to the 'struct pthread' of the calling thread.  This is also UB, but glibc implementation handles this scenarios as an invalid call to caller.

I don't see a easy way to report deadlock in this scenarios in a *reliable* way without some extra kernel support.
Comment 2 Shashank Agarwal 2020-09-04 05:23:07 UTC
(In reply to Adhemerval Zanella from comment #1)
> (In reply to Shashank Agarwal from comment #0)
> > Created attachment 12814 [details]
> > Test Case describing the scenario
> > 
> > Hello,
> > 
> > Description -
> > 
> > I have obeserved a behaviour in pthread_join(3) where 2 threads are trying
> > to join each other and one of the thread is getting hanged indefinitely as
> > it stuck in the futex(2) system call for the other thread to wake.
> > 
> > Test Case Description -
> > 
> > 1. Create multiple threads using pthread_create in the joinable state.
> > 2. thread1 joins thread2 using pthread_join.
> > 3. thread2 joins thread1 using pthread_join.
> > 
> > EXPECTED RESULT :
> > 1. Thread1 should join thread2.
> > 2. when thread2 tries to join thread1 it detects a deadlock.
> 
> The deadlock failure *may* be reportable by a pthread_join implementation,
> but it is not portable solution and relying on such behaviour is a fragile
> approach.  The glibc tries only to catch some possible scenarios which are
> the most straighforward, but there multiple other ones that current
> implementation does not handle.
> 
> A fully reliable deadlock implementation will either require some extra
> kernel support or a costly userspace implementation to maintain a user list
> per futex word (which I am not fully sure it is feasible).
> 
> > 
> > Actual Result :
> > Thread 2 is getting hanged.
> > 
> > Glibc Version -
> > 2.31, 2.28
> > 
> > Sample Code -
> > pthread_join_sample.c attached
> > 
> > Enviorenment -
> > $uname -a
> > Linux a1sb2-010 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
> > x86_64 x86_64 x86_64 GNU/Linux
> > 
> > Root Caues -
> > 
> > When any thread performs the pthread_join(3) on any thread, Glibc stores the
> > ID of the former thread(performing pthread_join(3)) into the member joinid
> > of the structure pthread of the thread on which pthread_join(3) is performed.
> > 
> > Hence in the expected scenario pthread_join(3) code flow works as follows -
> > 
> > 1. thread t1 performing pthread_join(3) on thread t2, GLibc performs below
> > checks before invoking futex wait-
> > 
> > Check 1 -
> > If the joinid of t1 contains the id of the t2 then it returns the EDEADLK if
> > not then moves to check 2,
> > 
> > Check 2
> > 
> > Glibc checks wheather anyother thread has already performed the pthread_join
> > on t2 or not. This is achieved with the help of joinid member of the thread
> > t2. If this member contains the NULL value it means no other thread has
> > performed the pthread_join on t2 and thread t1 updates its ID into the
> > joinid member of structure pthread of t2 in nptl/pthread_join_common.c and
> > perform futex wait. If the joinid member of thread t2 already contain some
> > non null value it means some other thread has already performed
> > pthread_join(3) on thread t2 and library returns EINVAL to the thread t1.
> > 
> > 2. When thread t2 performs pthread_join(3) on t1, Glibc fails in check 2 i.e
> > joinid of t2 contains the id of the t1 and it returns the EDEADLK.
> > 
> > As per analysis In Glibcv2.31, thread performing pthread_join(3) on any
> > thread is not able to update the ID of the former thread into the member
> > joinid of the other thread, Hence both the threads pass check 1 and check 2
> > and gone into the futex wait and one thread gets hanged. In glibc 2.31 below
> > code semantics is used to update the ID of the thread -
> > 
> > atomic_compare_exchange_weak_acquire (&pd->joinid,&self,NULL)
> > This macro internally calls the builtin function __atomic_compare_exchange_n
> > ((mem), (expected), (desired), 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) of
> > gcc. This built-in function implements an atomic compare and exchange
> > operation. This compares the contents of *ptr with the contents of
> > *expected. If equal, the operation is a read-modify-write operation that
> > writes desired into *ptr.
> > If they are not equal, the operation is a read and the current contents of
> > *ptr are written into *expected.
> > 
> > The expected value passed to this builtin function is &self i.e the ID of
> > the thread calling pthread_join(3). So as per our analysis joinid of the
> > thread on which pthread_join(3) is performed will never be equal to the ID
> > of the thread calling pthread_join(3) as its initial value will by NULL and
> > this member joinid is always updated from pthread_join(3) hence in builtin
> > function __atomic_compare_exchange_n the comaprision will always gets failed
> > and the joinid will never be updated. Both the threads will go into the
> > futex wait operation.
> > 
> > Fix -
> > 
> > In order to fix the issue, expected value passed to the function
> > __atomic_compare_exchange_n should be null and the desired value should be
> > the ID of the thread performing pthread_join(3) as below -
> > 
> > atomic_compare_exchange_weak_acquire (&pd->joinid,&null_ptr,self)
> 
> This is wrong, the 'joinid' here is used to synchronize with the
> pthread_detach to indicate a waiter.  The pthread_create initializes the
> field to 'NULL' for default case or a self-reference to its own 'struct
> pthread' for the case of a detached attribute. 
> 
> This CAS checks if pthread_join is being called for a detached thread, where
> 'joinid' refers to the 'struct pthread' of the calling thread.  This is also
> UB, but glibc implementation handles this scenarios as an invalid call to
> caller.
> 
> I don't see a easy way to report deadlock in this scenarios in a *reliable*
> way without some extra kernel support.

Hello,

I understand your concern. But as per my understanding in Glibc v2.21 joinid was updated with the ID of the thread calling pthread_join(3) that is why I thought in v2.31 this is the issue in CAS when updating the joinid -

else if (__builtin_expect (atomic_compare_and_exchange_bool_acq (&pd->joinid,
                                                                   self,
                                                                   NULL), 0))
    /* There is already somebody waiting for the thread.  */
    result = EINVAL;

Apart from this issue, I have also observed the another issue in Glibc v2.31 in pthread_join(3). As per the POSIX implementation of pthread_join(3) EINVAL should be return to the thread when pthread_join(3) is called for the thread on which another thread is already waiting to join. In Glibc v2.31 one of the thread is getting hanged and going for the futex wait indefinately for the same reason as mention in comment#1.

Regards
Shashank Agarwal
HCL Technologies Limited
shashank_agarwal@hcl.com
Comment 3 Andreas Schwab 2020-09-04 08:01:52 UTC
This is only a recommendation, not a requirement.