This is the mail archive of the glibc-linux@ricardo.ecn.wfu.edu mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: glibc 2.1.3pre1 with Linuxthreads 2.1.3pre2 made my applicationwork again


On Fri, 28 Jan 2000, Kai Engert wrote:

> condition occurred, that looked like an glibc internal inconsistency:
> 
> pthread_mutex_t Handle = {__m_reserved = 1074995304, __m_count = 0,
> __m_owner = 0x0, __m_kind = 2, __m_lock = {__status = 1, __spinlock = 0}
> 
> Looking at the glibc-source, for my understanding, it's not correct to
> have this kind of value combination in a pthread_mutex_t struct. While
> __status == 1, __m_owner should contain a value != 0.

That is not strictly true. The __m_owner field is set to a non-null
value (in error checking or recursive mutexes only) when the new owner
acquires the fastlock. So there is a brief window of execution during which the
lock is already held but the __m_owner field is not yet updated. And also
__m_owner is cleared just before the fastlock is given up. 

Of course, the state combination is not valid under any other circumstances
other than these two! 

> I recognized this problem because pthread_mutex_unlock returned "EPERM
> (the calling thread does not own the mutex)".

Which strongly suggests that two (or more) threads were fooled by the bug into
thinking they all have the mutex. The first one of them released it; and the
subsequent release by one of the other threads produced the error.

The only other way that could happen is if a thread tried to release a mutex
which it does not own; that can't be the case because you sound like you know
very well what you are doing, and the application runs fine for weeks with a
fixed library. 

> I was able to reproduce this condition multiple times.
> 
> Reading the mailing list, I found your recent bugfix
>        * spinlock.c: __pthread_lock queues back any received restarts
>          that don't belong to it instead of assuming ownership of lock
>          upon any restart; fastlock can no longer be acquired by two
> threads
>          simultaneously.

You are most likely having the same problem that is addressed by this
fix.  The problems that are addressed only occur in programs that:

-- cancel threads that are waiting on pthread_cond_wait,
   pthread_cond_timedwait, sem_wait or pthread_join; or

-- use pthread_cond_timedwait

Is this the case in your application? If so, I'm quite confident that you
ran into the bugs that were addressed.

A program that does not use timed out condition waits or which does not use
thread cancellation should not have problems, because it will not trigger the
race conditions that will cause a dangling restart.

For example, recycling threads using a thread pool instead of cancelling
threads and making new ones should make the problem go away (assuming you don't
also use pthread_cond_timedwait).

This is probably the reason why I haven't seen crash reports from the
maintainers of certain high profile projects; for example, I didn't hear about
any issues related to the AOL multithreaded web server. You'd expect them to
run into these bugs; but then again they probably create a pool of threads that
are never canceled, and don't use pthread_cond_timewait.  So they avoid
triggering the problems by dumb luck.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]