This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
Weird behavior observed with NPTL semaphores
- From: "Tetreault, Francois" <ftetreau at ciena dot com>
- To: "libc-help at sourceware dot org" <libc-help at sourceware dot org>
- Date: Thu, 30 Oct 2014 16:37:22 -0400
- Subject: Weird behavior observed with NPTL semaphores
- Authentication-results: sourceware.org; auth=none
Hello,
We have questions about the glibc Native POSIX Thread Library (NPTL).
We have an application which has a few threads, where mutexs are used to arbitrate access to data.
The Mutex object content is as shown below.
mMutex = {
__data = {
__lock = -2147473878,
__count = 0,
__owner = 0,
__kind = 33,
__nusers = 0,
{
__spins = 0,
__list = {
__next = 0x0
}
}
},
__size = "\200\000&*", '\000' <repeats 11 times>, "!\000\000\000\000\000\000\000",
__align = -2147473878
}
Where 33 translates to:
#define PTHREAD_MUTEX_TYPE(m) ((m)->__data.__kind & 127)
PTHREAD_MUTEX_PRIO_INHERT_NP = 32
PTHREAD_MUTEX_RECURSIVE_NP = 1
PTHREAD_MUTEX_PI_RECURSIVE_NP = PTHREAD_MUTEX_PRIO_INHERT_NP | PTHREAD_MUTEX_RECURSIVE_NP
A problem occurs, only once in a blue moon, where the code fails to release the semaphore. It complains about the semaphore not being owned by any threads when it comes to give it away.
We have added our own instrumentation, to hopefully understand what is going on. See our trace below.
Caution; our tracing is not perfect as it is not reentrant; we could easily get preempted while we are capturing the data.
Also note that, in our trace:
. "pre" is the value of the fields prior to the mutex operation, and "post" is afterwards.
. MUTEX_GIVE is a call to pthread_mutex_unlock(), and
. MUTEX_TAKE is a call to pthread_mutex_lock().
{ [trace 1]
calling_task = 3659,
action = MUTEX_GIVE,
pre_count = 1,
pre_owner = 3659,
post_count = 0,
post_owner = 0
}, { [trace 2]
calling_task = 4690,
action = MUTEX_TAKE,
pre_count = 0,
pre_owner = 0,
post_count = 1,
post_owner = 4690
}, { [trace 3]
calling_task = 3659,
action = MUTEX_TAKE,
pre_count = 1,
pre_owner = 4690,
post_count = 1,
post_owner = 3659
}, { [trace 4]
calling_task = 4690,
action = MUTEX_GIVE,
pre_count = 1,
pre_owner = 4690,
post_count = 0,
post_owner = 0
}, { [trace 5]
calling_task = 3659,
action = MUTEX_GIVE,
pre_count = 0,
pre_owner = 0,
post_count = 0,
post_owner = 0
}, { [trace 6]
calling_task = 4690,
action = MUTEX_TAKE,
pre_count = 0,
pre_owner = 0,
post_count = 0,
post_owner = 0
}, { [trace 7]
calling_task = 3659,
action = MUTEX_TAKE,
pre_count = 0,
pre_owner = 0,
post_count = 1,
post_owner = 0
}, { [trace 8]
calling_ta sk = 3659,
action = MUTEX_GIVE,
pre_count = 1,
pre_owner = 0,
post_count = 1,
post_owner = 0
}
In the end [trace 8], the Mutex content is as follows:
mMutex = {
__data = {
__lock = -2147479989,
__count = 1,
__owner = 0,
__kind = 33,
__nusers = 0,
{
__spins = 0,
__list = {
__next = 0x0
}
}
},
__size = "\200\000\016K\000\000\000\001\000\000\000\000\000\000\000!\000\000\000\000\000\000\000",
__align = -2147479989
}
}
The trace data actually triggered more questions than answers.
1. Is it ever a valid state to have a count greater than 0 while the value of owner is 0?
2. Note that our code asserts if any non-successful code is returned from calling either pthread_mutex_unlock() or pthread_mutex_lock().
3. In [trace 5], coming in (pre) we expected the mutex to be owned by 3659, but both count and owner are set to 0.
4. Starting from this point on, the content of the trace seems to be falling apart. Yet our code only asserts when it gets to [trace 8]!
5. Also notice that the owner field is always 0 from [trace 5] onwards.
6. Is there any known bugs that could lead to this weird behavior?
Info about the system.
. Linux Kernel version: 3.4.36
. Glibc version: 2.9 "stable"
. GCC version: powerpc-e500-linux-gnuspe-gcc (GCC) 4.6.3
. Processor: Freescale MPC8572
. Mode of operation: Symmetric Multi-Processing (SMP)
Thank you,
Francois